Using the Visual Words based on Affine-SIFT Descriptors for Face Recognition

Usng the Vsual Words based on Affne-SIFT Descrptors for Face Recognton Yu-Shan Wu, Heng-Sung Lu, Gwo-Hwa Ju, Tng-We Lee, Yen-Ln Chu Busness Customer Solutons Lab., Chunghwa Telecommuncaton Laboratores 2, Lane 55, Mn-Tsu Road Sec.5 Yang-Me, Taoyuan, Tawan 3260, R.O.C. {yushanwu, lhs306, jgh, fnas, lews32330}@cht.com.tw Abstract Vdeo-based face recognton has drawn a lot of attenton n recent years. On the other hand, Bag-of-vsual Words (BoWs) representaton has been successfully appled n mage retreval and object recognton recently. In ths paper, a vdeo-based face recognton approach whch uses vsual words s proposed. In classc vsual words, Scale Invarant Feature Transform (SIFT) descrptors of an mage are frstly extracted on nterest ponts detected by dfference of Gaussan (DoG), then k-means-based vsual vocabulary generaton s appled to replace these descrptors wth the ndexes of the closet vsual words. However, n facal mages, SIFT descrptors are not good enough due to facal pose dstorton, facal expresson and lghtng condton varaton. In ths paper, we use Affne-SIFT (ASIFT) descrptors as facal mage representaton. Expermental results on UCSD/Honda Vdeo Database and VdTIMIT Vdeo Database suggest that vsual words based on Affne-SIFT descrptors can acheve lower error rates n face recognton task. Keywords-component; face recognton, SIFT, Affne-SIFT, vsual words. I. INTRODUCTION Vdeo-based face recognton has been a popular research topc because of ts scentfc challenges and wde use of vdeo montorng. However, there are many well-known approaches proposed to overcome the face recognton problems. Among them, Prncple Component Analyss (PCA) [] searches a subspace n the feature space that has the largest varance and then projects the feature vectors onto t. Lnear Dscrmnant Analyss (LDA) [2] attemps to obtan another subspace whch can maxmze the rato of betweenclass varance to the wthn-class varance. Localty Preservng Projecton (LPP) [3] also trys to fnd an optmal lnear transform that preserves local neghbor nformaton of data set n a certan sense. Recentl methods based on multple mages/vdeo sequences for face recognton are also proposed. Mutual Subspace Method (MSM) [4] consders the mnmum angle between nput and reference subspaces as measure of smlart and each subspace s formed by PCA operaton on mage sequence from each person. Constraned Mutual Subspace Method (CMSM) [5][6][7] s an mproved verson of MSM. The constructon of nput and reference subspaces are the same as n MSM, except the bases of these subspaces are further projected onto a constraned subspace and the projected bases are used to calculate the smlarty between two persons. All the above methods are concentratng on the projecton or transformaton of feature vector. The feature vector of a face mage used by these methods s usually smple gray value n row-major order. However, the feature selecton and extracton are also extremely mportant n face recognton. Recentl Bag-of-vsual Words (BoWs) [8][9] mage representaton has been utlzed n many computer vson problems and has demonstrated mpressve performance. In ths method, Scale-Invarant Feature Transform (SIFT) [0] features of an mage are frstly extracted on nterest ponts whch are usually detected by dfference of Gaussan (DoG) method. Then a clusterng method s used to convert these SIFT features to codeword hstogram. Fnally the degree of smlarty between two mages can thus be measured by the dstance between ther hstograms. Dfferent face mages of the same person obtaned by camera n varyng poston and angle undergo apparent deformatons. These deformatons can be allevated by affne transform of mage plane. The parameters of the affne transform can be descrbed scale, rotaton, translaton, camera lattude and longtude angles. Although SIFT method s nvarant to three out of the above fve parameters, t s not good enough. ASIFT method [] s proposed to cover all fve parameters and has been proved to be fully affne nvarant. Furthermore, the computaton complexty of the ASIFT method can be reduced to about twce of SIFT method by a two-resoluton scheme. In ths paper, we appled ASIFT Vsual Words as face mage representaton. Expermental results on UCSD/Honda [2] Vdeo Database and VdTIMIT [3] Vdeo Database show that ASIFT Vsual Words method s superor to other classcal methods. Ths paper s organzed as follows. In Secton II, we ntroduce three representatons for face mage, SIFT Method, ASIFT method and the proposed method. In Secton III, expermental results based on famous UCSD/Honda Vdeo Database and VdTIMIT Vdeo Database are depcted. Fnall the dscusson and concluson are presented n secton IV. II. FACE RECOGNITION APPROACHES In ths secton, we ntroduce two mage representatons mentoned prevously n Secton I and the proposed method.

In secton 2. and 2.2, SIFT method and ASIFT method for face mage representaton are ntroduced respectvely. In secton 2.3, the proposed face recognton method s derved. Fnall the performance evaluaton of face recognton n vdeo sequences s ntroduced n secton 2.4. 2. Scale Invarant Feature Transform(SIFT) SIFT method compares two mages by a rotaton, a translaton and a scale change to decde whether one mage can be deduced from the other mage. To acheve scale nvarance, SIFT smulates the zoom n scale space. Ths can be accomplshed by searchng for stable ponts across all possble scales and these stable ponts can be thought to be nvarant to scale change. The scale space of an mage s formed by the convoluton of ths mage wth a varablescale Gaussan G ( σ ) at several scales, where σ s the scale parameter. The convoluton result L ( σ ) can be defned as: L( σ ) = G( σ ) I( y) () Where * means the convoluton operaton at coordnates, and ( x, y) G( σ ) e 2 2πσ 2 2 ( x + y ) / 2 2 σ = (2) In order to detect stable keyponts n scale space effcentl the method proposed by Lowe [0] s used, whch uses the dfference-of-gaussan functon convolved wth the mage. The dfference of two nearby scales separated by a constant scale factor c can be computed as: D( σ ) = ( G( cσ ) G( σ ))* I( y) = L( cσ ) L( σ ) In any case for scale space feature descrpton, the smoothed mages L at every scale need to be computed. Therefore, n ths method D can be computed by smple mage subtracton. In order to detect the extrema relabl there has an mportant ssue on how to determne the frequency of samplng n scale and spatal domans. Here we take the settngs made by Lowe [0], 3 scales per octave and the standard devaton σ of Gaussan G s set to be 0.5. By takng apart all samplng ssues and by applyng several thresholds to elmnate unrelable features, the SIFT method computes scale-space extrema ( x, y, σ ) of the spatal Laplacan L ( σ ) and samples for each of these extrema a square mage patch centered at x, y ), whch ( has domnant gradents over ts neghbors. Because the resultng mage patches at scale σ are searched basng on gradent drecton, t s nvarant to llumnaton changes. Furthermore, only local hstograms of the drecton of the gradent are kept, the SIFT descrptor s nvarant to translaton and rotaton. 2.2 Affne-SIFT(ASIFT) The dea of combnng smulatng all zooms out of the query mage and normalzng rotaton and translaton s the (3) man ngredent of SIFT method. Based on ths dea, ASIFT method smulates the two camera axs parameters, the longtude angle and the lattude angle (whch s equvalent to tlt), and then apples SIFT method to smulate scale (zoom out) and to normalze translaton and rotaton. Smlar to SIFT method, the samplng frequency needs to be consdered because smulatng the whole affne space s not prohbtve and mpractcal. Furthermore, a tworesoluton scheme for comparng the smlarty between two mages needs to be used to reduce the ASIFT complexty. 2.2. ASIFT algorthm Step: Smulates all possble affne dstortons of the query mage, where the dstortons are caused by the change of camera optcal axs orentaton from a frontal vew. The degree of dstorton s depended on two parameters, the longtude angle φ and the lattude angle θ. For longtude angle φ, the query mage undergoes rotatons. For lattude angle θ, the query mage undergoes subsamples wth parameter t = cosθ, whch means the convoluton by a Gaussan wth standard devaton k t 2. The constant value k = 0.8 s settled by Lowe [0]. Step2: Snce the effcency of computaton needed to be taken nto account, the samplng steps are performed on a fnte number of lattude angles and longtude angles. Step3: All smulated mages from the query mage are compared by a smlarty matchng method (SIFT). 2.2.2 Acceleraton wth a two-resoluton scheme for ASIFT The two-resoluton scheme s used to accelerate the process of computng the smlarty between two mages. The man dea of ths scheme s frstly selectng the affne transforms that yelds well matches at low-resoluton. Then t smulates mages from the query and the searched mages both at these selected affne transforms and at orgnalresoluton. Fnall computes the smlarty between these smulated mages. The steps of two-resoluton scheme are summarzed as follows: Step: Computes the low-resoluton mages of the query mage u and the searched mage v by usng a Gaussan Flter and a downsamplng operator. The resultng low-resoluton mages can be defned as: u' = PF GFu and v' = PF GFv (4) where u ' and v ' are the low-resoluton mages of u and v, respectvely. G F and P F are the Gaussan Flter and downsamplng operator, respectvely. And the subndex F represents the sze factor of operator. Step2: Apples ASIFT method to u ' and v '.

Fg. Matchng ablty comparson between SIFT and ASIFT method. Step3: Selects M affne transforms that yeldng well matches between u and v. Step4: Apples ASIFT method on u and v at the M affne transforms selected by step3. And chooses the best match among these M transforms as the smlarty between u and v. Fg. shows some face examples to compare the matchng ablty between SIFT and ASIFT method, n whch left two columns show the matchng results for SIFT method and rght two columns show the matchng results for ASIFT method respectvely. From Fg. we can see that when the varaton of facal pose and angle of the same person s hgher, SIFT method could not fnd any match. And among all examples, the matchng ablty of ASIFT method s obvously superor to SIFT method. 2.3 The proposed method The face mage representaton we adopted n ths paper s ASIFT Vsual Words. In ths representaton, vsual vocabulary must frstly be generated. Vsual vocabulary s generated by usng herarchcal K-means method to cluster a large number of ASIFT descrptors. The reason we adopted herarchcal K-means method here s by consderng both the clusterng tme and the clusterng effcency. When the clusterng process s convergent, the centers of all clusters are formng the vsual vocabulary. Now the ASIFT descrptors of every face mage can be converted to a hstogram form by usng vsual vocabulary. Suppose there are z centers (say code words) { C, C2,..., C z } n vsual vocabulary and there are r ASIFT descrptors { A, A2,..., A r } n a face mage. For each descrptor j r, we A j, calculate the Eucldan Dstances between A j and all the centers C, z. Chooses the center whch has mnmum dstance and records the ndex of ths center n R( A j ), j r. The vsual words representaton of ths mage s defned as: r H ( ) = E ( j), z, r j = where E ( j) s as follows:, f R( Aj ) ==. E ( j) { 0, otherwse. ( (5) = (6) And H ) s a hstogram of length z and t s also the vsual words representaton of ths face. The dstance between two vsual words representatons from two faces can be evaluated by Bhattacharyya dstance [4]. 2.4 Performance Evaluaton of Face Recognton In Vdeo Sequences There are many schemes n face classfcaton of vdeo sequences, such as probablstc majorty votng and Bayes maxmum a posteror scheme proposed n [5]. In both schemes, the smlarty between a test mage and a vdeo sequences s computed by consderng the smlartes between ths test mage and all mages n ths vdeo sequences. Ths s not approprate snce two face mages of a person from two dfferent facal poses may ntroduce lower smlarty. And ths wll lower the overall smlarty between a test mage and a vdeo sequences from the same person. In ths paper, we defne the smlarty between a test mage w and a vdeo sequences S as: Sm( w, S) = max Sm( w, s ), s S. (7) Where s s a face mage n the Vdeo Sequence S. In ths defnton, among all the smlartes between a test mage and all face mages n a vdeo sequences, only the maxmal smlarty s used. III. EXPERIMENTAL RESULTS In ths secton, we use the popular UCSD/Honda vdeo database and VdTIMIT vdeo database to evaluate the performance of face recognton. In UCSD/Honda vdeo database, there are 59 vdeo sequences of 20 dfferent people. There are about 300-600 frames n each vdeo, ncludng large pose and expresson varatons wth sgnfcantly complex out-of-plane (3-D) head rotatons. These 59 vdeo sequences are further dvded nto a tranng subset of 20 vdeos and a testng subset of 39 vdeos. Among all members n ths database, only one person dd not take any testng vdeo.

TABLE I EXPERIMENTAL RESULTS ON UCSD/HONDA DATABASE Method Recognton Rate LBP 76.87% LGBP 74.85% Fg. 2 Sample face mages extracted from UCSD/Honda Vdeo Database. MBLBP 75.54% SIFT Vsual Words 72.95% ASIFT Vsual Words 87.36% TABLE II EXPERIMENTAL RESULTS ON VIDTIMIT DATABASE Method Recognton Rate Fg. 3 Sample face mages extracted from VdTIMIT Vdeo Database. VdTIMIT vdeo database contans 43 dfferent people and each person has 3 vdeo sequences. There are about 250-500 frames n each vdeo sequences. The frst 3 sequences of each person were taken under 4 regular head movements (left, rght, up and down) and the last 0 sequences were taken under speakng short sentences. In our experment, we use the frst sequence as tranng vdeo and use the second and the thrd sequences as testng vdeos. For all vdeo sequences, Vola-Jones face detector [6] was frstly appled to detect face n each frame. Then we would manually delete frames whch the detected poston of face s ncorrect. Fg. 2 and Fg. 3 show some correctly detected face samples on these two corpora. All detected faces are preprocessed by llumnaton compensaton [7]. In our experment, the frst 25 frames and the frst 00 frames respectvely from every subject s tranng and testng face sequences are used for performance evaluaton. And the numbers of vsual phrases we used n UCSD/Honda database and VdTIMIT database are 9000 and 6384, respectvely. The proposed ASIFT Vsual Words method s compared aganst other four classcal approaches, they are LBP[8], MBLBP[9], Local Gabor Bnary Pattern[20], SIFT Vsual Words. The recognton rates on UCSD/Honda vdeo database and on VdTIMIT vdeo database are summarzed n Tables I and II, respectvely. From the Table I we can see that the proposed method s superor to other classcal approaches. From Table II, the recognton rate of our proposed method s also superor to other classcal approaches, but the mprovement of performance s not obvous. The reason s probably that the head movements are regular n ths database and the facal poses are smlar n tranng and testng vdeos. Furthermore, there s no facal expresson change n the frst 3 vdeo sequences used n our experment. LBP 90.25% LGBP 77.0% MBLBP 85.7% SIFT Vsual Words 88.8% ASIFT Vsual Words 92.0% IV. CONCLUSIONS In ths paper, we proposed a face recognton method whch uses ASIFT vsual words as mage representaton. We also proposed a face recognton scheme n vdeo sequences to further mprove face recognton performance. REFERENCES [] Turk. M., Pentland. A., Egenfaces for recognton, Journal of Cogntve Neuro-scence 3, pp. 7-86, 99. [2] Etemad. K., Chellappa. R., Dscrmnant analyss for recognton of human face mages, Journal of the Optcal Socety of Amerca 4, pp. 724-733, 997. [3] X.F. He and P. Nyog, Localty preservng projectons, Proc. Of NIPS 03, 2003. [4] O. Yamaquch, K. Fuku and K.maeda, Face recognton usng temporal mage sequence, Pro, Int l Conf. on Automatc Face and Gesture Recognton, pp. 38-323, 998. [5] K. Fuku, and O. Yamaquch, Face recognton usng multvewpont patterns for robot vson, Symp. Of Robotcs Research, 2003. [6] Masash Nshyama, Osamu Yamaguch, and Kazuhro Fuku, Face recognton wth the multple constraned mutual subspace method, Audo- and Vdeo-Based Bometrc Person Authentcaton, 5th Internatonal Conference, 2005. [7] Kazuhro Fukul, Bjorn Stenger, and Osamu Yamaguch, A framework for 3D object recognton usng the kernel constraned

mutual subspace method, Asan Conference on Computer Vson(ACCV), 2006. [8] S. Lazebnk, C. Schmd, and J. Ponce, Beyond bags of features: Spatal pyramd matchng for recognzng natural scene categores, n Proc. CVPR, pp. 269-278, 2008. [9] F. Perronnn and C. Dance, Fsher kernels on vsual vocabulary for mage categorzaton, n Proc. CVPR, pp. -8, 2007. [0] D. Lowe, Dstnctve mage features from scale-nvarant key ponts, Int. J. Comput. Vs., 60, pp. 9-0, 2004. [] J. M. Morel and G. Yu, ASIFT: A new framework for fully affne nvarant mage comparson, SIAM Journal on Image Scence, vol. 2 ssue 2, 2009. [2] K. C. Lee, J. Ho, M.H. Yang and D. Kregman, Vsual trackng and recognton usng probablstc appearance manfolds, Computer Vson and Image Understandng, vol. 99(3), pp. 303-33, 2005. [3] C. Sanderson and B. C. Lovell, Mult-regon probablstc hstograms for robust and scalable dentty nference, Lecture Notes n Computer Scence(LNCS), vol. 5558, pp. 99-208, 2009. [4] Kalath, T., The dvergence and Bhattacharyya dstance measures n sgnal selecton, IEEE Transactons on Communcaton Technology 5(): 52-60. DOI: 0. 09/TCOM. 089532, 967. [5] See, J. and Fauz, M.F.A., Neghborhood dscrmnatve manfold projecton for face recognton n vdeo, Internatonal Conference on Pattern Analyss and Intellgent Robotcs(ICPAIR), vol., pp. 3-8, 20. [6] P. Vola and M. Jones, Robust real-tme face detecton, Int. Journal of Computer Vson, vol. 57(2), pp. 37-54, 2004. [7] Ngoc-Son Vu and Capler, A., Illumnaton-robust face recognton usng retna modelng, IEEE Internatonal Conference on Image Processng(ICIP), pp. 3289-3292, 2009. [8] Ahonen, T., Petkanen, M., Face recognton wth local bnary patterns. In: Proceedngs of the European Conference on Computer Vson, Prague, Czech, pp. 469-48, 2004. [9] Shengca Lao, Xangxn Zhu, Zhen Le, Lun Zhang, and Stan Z. L, Learnng mult-scale block local bnary patterns for face recognton, Lecture Notes n Computer Scence(LNCS), vol. 4642, pp. 828-837, 2007. [20] Xnghua Sun, Hongxa Xu, Chunxa Zhao and Jngyu Yang, Facal expresson recognton based on hstogram sequence of local gabor bnary patterns, IEEE Conference on Cybernetcs and Intellgent Systems, pp. 58-63, 2008.