FACE detection and alignment are essential to many face

Size: px

Start display at page:

Download "FACE detection and alignment are essential to many face"

Jemima Davis
5 years ago
Views:

1 IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO. 10, OCTOBER Jont Face Detecton and Algnment Usng Multtask Cascaded Convolutonal Networks Kapeng Zhang, Zhanpeng Zhang, Zhfeng L, Senor Member, IEEE, andyuqao, Senor Member, IEEE Abstract Face detecton and algnment n unconstraned envronment are challengng due to varous poses, llumnatons, and occlusons. Recent studes show that deep learnng approaches can acheve mpressve performance on these two tasks. In ths letter, we propose a deep cascaded multtask framework that explots the nherent correlaton between detecton and algnment to boost up ther performance. In partcular, our framework leverages a cascaded archtecture wth three stages of carefully desgned deep convolutonal networks to predct face and landmark locaton n a coarse-to-fne manner. In addton, we propose a new onlne hard sample mnng strategy that further mproves the performance n practce. Our method acheves superor accuracy over the stateof-the-art technques on the challengng face detecton dataset and benchmark and WIDER FACE benchmarks for face detecton, and annotated facal landmarks n the wld benchmark for face algnment, whle keeps real-tme performance. Index Terms Cascaded convolutonal neural network (CNN), face algnment, face detecton. I. INTRODUCTION FACE detecton and algnment are essental to many face applcatons, such as face recognton and facal expresson analyss. However, the large vsual varatons of faces, such as occlusons, large pose varatons, and extreme lghtngs, mpose great challenges for these tasks n real-world applcatons. The cascade face detector proposed by Vola and Jones [2] utlzes Haar-Lke features and AdaBoost to tran cascaded classfers, whch acheves good performance wth real-tme effcency. However, qute a few works [1], [3], [4] ndcate that ths knd of detector may degrade sgnfcantly n real-world applcatons wth larger vsual varatons of human faces even wth more advanced features and classfers. Besdes the cas- Manuscrpt receved Aprl 7, 2016; revsed June 12, 2016 and July 31, 2016; accepted August 10, Date of publcaton August 26, 2016; date of current verson September 9, Ths work was supported n part by External Cooperaton Program of BIC, n part by Chnese Academy of Scences (172644KYSB , KYSB ), n part by Shenzhen Research Program under Grant KQCX , Grant JSGG , Grant CXZZ , Grant CYJ , and Grant JCYJ , n part by Guangdong Research Program under Grant 2014B and Grant 2015B , n part by the Natural Scence Foundaton of Guangdong Provnce under Grant 2014A , and n part by the Key Laboratory of Human Machne Intellgence-Synergy Systems through the Chnese Academy of Scences. The assocate edtor coordnatng the revew of ths manuscrpt and approvng t for publcaton was Dr. Alexandre X. Falcao. K. Zhang, Z. L, and Y. Qao are wth Shenzhen Insttutes of Advanced Technology, Chnese Academy of Scences, Shenzhen , Chna (e-mal: kp.zhang@sat.ac.cn; zhfeng.l@sat.ac.cn; yu.qao@sat.ac.cn). Z. Zhang s wth the Department of Informaton Engneerng, The Chnese Unversty of Hong Kong, Hong Kong (e-mal: zz013@e.cuhk.edu.hk). Color versons of one or more of the fgures n ths letter are avalable onlne at Dgtal Object Identfer /LSP cade structure, Mathas et al. [5] [7] ntroduce deformable part models for face detecton and acheve remarkable performance. However, they are computatonally expensve and may usually requre expensve annotaton n the tranng stage. Recently, convolutonal neural networks (CNNs) acheve remarkable progresses n a varety of computer vson tasks, such as mage classfcaton [9] and face recognton [10]. Inspred by the sgnfcant successes of deep learnng methods n computer vson tasks, several studes utlze deep CNNs for face detecton. Yang et al. [11] tran deep CNNs for facal attrbute recognton to obtan hgh response n face regons, whch further yeld canddate wndows of faces. However, due to ts complex CNN structure, ths approach s tme costly n practce. L et al. [19] use cascaded CNNs for face detecton, but t requres boundng box calbraton from face detecton wth extra computatonal expense and gnores the nherent correlaton between facal landmarks localzaton and boundng box regresson. Face algnment also attracts extensve research nterests. Research works n ths area can be roughly dvded nto two categores, regresson-based methods [12], [13], [16], and template fttng approaches [7], [14], [15]. Recently, Zhang et al. [22] proposed to use facal attrbute recognton as an auxlary task to enhance face algnment performance usng deep CNN. However, most of prevous face detecton and face algnment methods gnore the nherent correlaton between these two tasks. Though several exstng works attempt to jontly solve them, there are stll lmtatons n these works. For example, Chen et al. [18] jontly conduct algnment and detecton wth random forest usng features of pxel value dfference. But, these handcraft features lmt ts performance a lot. Zhang et al. [20] use multtask CNN to mprove the accuracy of multvew face detecton, but the detecton recall s lmted by the ntal detecton wndow produced by a weak face detector. On the other hand, mnng hard samples n tranng s crtcal to strengthen the power of detector. However, tradtonal hard sample mnng usually performs n an offlne manner, whch sgnfcantly ncreases the manual operatons. It s desrable to desgn an onlne hard sample mnng method for face detecton, whch s adaptve to the current tranng status automatcally. In ths letter, we propose a new framework to ntegrate these two tasks usng unfed cascaded CNNs by multtask learnng. The proposed CNNs consst of three stages. In the frst stage, t produces canddate wndows quckly through a shallow CNN. Then, t refnes the wndows by rejectng a large number of nonfaces wndows through a more complex CNN. Fnally, t uses a more powerful CNN to refne the result agan and output fve facal landmarks postons. Thanks to ths multtask learnng framework, the performance of the algorthm can be notably mproved. The major contrbutons of ths letter are summarzed as follows: IEEE. Personal use s permtted, but republcaton/redstrbuton requres IEEE permsson. See standards/publcatons/rghts/ndex.html for more nformaton.

2 1500 IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO. 10, OCTOBER 2016 TABLE I COMPARISON OF SPEED AND VALIDATION ACCURACY OF OUR CNNS AND PREVIOUS CNNS [19] Group CNN 300 Forward Propagaton Valdaton Accuracy Group1 12-Net [19] s 94.4% P-Net s 94.6% Group2 24-Net [19] s 95.1% R-Net s 95.4% Group3 48-Net [19] s 93.2% O-Net s 95.4% Stage 3: Ths stage s smlar to the second stage, but n ths stage we am to dentfy face regons wth more supervson. In partcular, the network wll output fve facal landmarks postons. Fg. 1. Ppelne of our cascaded framework that ncludes three-stage multtask deep convolutonal networks. Frst, canddate wndows are produced through a fast P-Net. After that, we refne these canddates n the next stage through a R-Net. In the thrd stage, the O-Net produces fnal boundng box and facal landmarks poston. 1) We propose a new cascaded CNNs-based framework for jont face detecton and algnment, and carefully desgn lghtweght CNN archtecture for real-tme performance. 2) We propose an effectve method to conduct onlne hard sample mnng to mprove the performance. 3) Extensve experments are conducted on challengng benchmarks to show sgnfcant performance mprovement of the proposed approach compared to the state-of-the-art technques n both face detecton and face algnment tasks. II. APPROACH In ths secton, we wll descrbe our approach toward jont face detecton and algnment. A. Overall Framework The overall ppelne of our approach s shown n Fg. 1. Gven an mage, we ntally resze t to dfferent scales to buld an mage pyramd, whch s the nput of the followng three-stage cascaded framework. Stage 1: We explot a fully convolutonal network, called proposal network (P-Net), to obtan the canddate facal wndows and ther boundng box regresson vectors. Then canddates are calbrated based on the estmated boundng box regresson vectors. After that, we employ nonmaxmum suppresson (NMS) to merge hghly overlapped canddates. Stage 2: All canddates are fed to another CNN, called refne network (R-Net), whch further rejects a large number of false canddates, performs calbraton wth boundng box regresson, and conducts NMS. B. CNN Archtectures In [19], multple CNNs have been desgned for face detecton. However, we notce ts performance mght be lmted by the followng facts: 1) Some flters n convoluton layers lack dversty that may lmt ther dscrmnatve ablty; (2) compared to other multclass objecton detecton and classfcaton tasks, face detecton s a challengng bnary classfcaton task, so t may need less numbers of flters per layer. To ths end, we reduce the number of flters and change the 5 5 flter to 3 3 flter to reduce the computng, whle ncrease the depth to get better performance. Wth these mprovements, compared to the prevous archtecture n [19], we can get better performance wth less runtme (the results n tranng phase are shown n Table I. For far comparson, we use the same tranng and valdaton data n each group). Our CNN archtectures are shown n Fg. 2. We apply PReLU [30] as nonlnearty actvaton functon after the convoluton and fully connecton layers (except output layers). C. Tranng We leverage three tasks to tran our CNN detectors: face/nonface classfcaton, boundng box regresson, and facal landmark localzaton. 1) Face Classfcaton: The learnng objectve s formulated as a two-class classfcaton problem. For each sample x,we use the cross-entropy loss as L det = ( y det log (p )+ ( 1 y det ) (1 log(p )) ) (1) where p s the probablty produced by the network that ndcates sample x beng a face. The notaton y det {0, 1} denotes the ground-truth label. 2) Boundng Box Regresson: For each canddate wndow, we predct the offset between t and the nearest ground truth (.e., the boundng boxes left, top, heght, and wdth). The learnng objectve s formulated as a regresson problem, and we employ the Eucldean loss for each sample x L box = ŷbox y box 2 (2) 2 where ŷ box s the regresson target obtaned from the network and y box s the ground-truth coordnate. There are four coordnates, ncludng left top, heght and wdth, and thus y box R 4.

3 ZHANG et al.: JOINT FACE DETECTION AND ALIGNMENT USING MULTITASK CASCADED CONVOLUTIONAL NETWORKS 1501 Fg. 2. Archtectures of P-Net, R-Net, and O-Net, where MP means max poolng and Conv means convoluton. The step sze n convoluton and poolng s 1 and 2, respectvely. 3) Facal Landmark Localzaton: Smlar to boundng box regresson task, facal landmark detecton s formulated as a regresson problem and we mnmze the Eucldean loss as L landmark = ŷlandmark y landmark 2 (3) 2 where ŷ landmark s the facal landmark s coordnates obtaned from the network and y landmark s the ground-truth coordnate for the th sample. There are fve facal landmarks, ncludng left eye, rght eye, nose, left mouth corner, and rght mouth corner, and thus y landmark R 10. 4) Multsource Tranng: Snce we employ dfferent tasks n each CNN, there are dfferent types of tranng mages n the learnng process, such as face, nonface, and partally algned face. In ths case, some of the loss functons [.e., (1) (3)] are not used. For example, for the sample of background regon, we only compute L det, and the other two losses are set as 0. Ths can be mplemented drectly wth a sample type ndcator. Then, the overall learnng target can be formulated as mn N =1 j {det,box,landmark} α j β j Lj (4) where N s the number of tranng samples and α j denotes on the task mportance. We use (α det =1,α box = 0.5,α landmark = 0.5) n P-Net and R-Net, whle (α det = 1,α box = 0.5,α landmark =1) n output network (O-Net) for more accurate facal landmarks localzaton. β j {0, 1} s the sample type ndcator. In ths case, t s natural to employ stochastc gradent descent to tran these CNNs. 5) Onlne Hard Sample Mnng: Dfferent from conductng tradtonal hard sample mnng after orgnal classfer had been traned, we conduct onlne hard sample mnng n face/nonface classfcaton task whch s adaptve to the tranng process. In partcular, n each mnbatch, we sort the losses computed n the forward propagaton from all samples and select the top 70% of them as hard samples. Then, we only compute the gradents from these hard samples n the backward propagaton. That means we gnore the easy samples that are less helpful to strengthen the detector durng tranng. Experments show that ths strategy yelds better performance wthout manual sample selecton. Its effectveness s demonstrated n Secton III. Fg. 3. (a) Detecton performance of P-Net wth and wthout onlne hard sample mnng. (b) JA denotes jont face algnment learnng n O-Net whle No JA denotes do not jont t. No JA n BBR denotes use No JA O-Net for boundng box regresson. III. EXPERIMENTS In ths secton, we frst evaluate the effectveness of the proposed hard sample mnng strategy. Then, we compare our face detector and algnment aganst the state-of-the-art methods n face detecton dataset and benchmark (FDDB) [25], WIDER FACE [24], and annotated facal landmarks n the wld (AFLW) benchmark [8]. FDDB dataset contans the annotatons for 5171 faces n a set of 2845 mages. WIDER FACE dataset conssts of labeled face boundng boxes n mages, where 50% of them for testng (dvded nto three subsets accordng to the dffculty of mages), 40% for tranng, and the remanng for valdaton. AFLW contans the facal landmarks annotatons for faces and we use the same test subset as [22]. Fnally, we evaluate the computatonal effcency of our face detector. A. Tranng Data Snce we jontly perform face detecton and algnment, here we use followng four dfferent knds of data annotaton n our tranng process: 1) negatves: regons whose the ntersecton-over-unon (IoU) rato s less than 0.3 to any ground-truth faces; 2) postves: IoU above 0.65 to a ground truth face; 3) part faces: IoU between 0.4 and 0.65 to a ground truth face; and 4) landmark faces: faces labeled fve landmarks postons. There s an unclear gap between part faces and negatves, and there are varances among dfferent face annotatons. So, we choose IoU gap between 0.3 and 0.4. Negatves and postves are used for face classfcaton tasks, postves and part faces are

4 1502 IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO. 10, OCTOBER 2016 FDDB. Fg. 3(a) shows the results from two dfferent P-Nets on FDDB. It s clear that the onlne hard sample mnng s benefcal to mprove performance. It can brng about 1.5% overall performance mprovement on FDDB. C. Effectveness of Jont Detecton and Algnment To evaluate the contrbuton of jont detecton and algnment, we evaluate the performances of two dfferent O-Nets (jont facal landmarks regresson learnng and do not jont t) on FDDB (wth the same P-Net and R-Net). We also compare the performance of boundng box regresson n these two O- Nets. Fg. 3(b) suggests that jont landmark localzaton task learnng help to enhance both face classfcaton and boundng box regresson tasks. Fg. 4. (a) Evaluaton on FDDB. (b) (d) Evaluaton on three subsets of WIDER FACE. The number followng the method ndcates the average accuracy. Fg. 5. Evaluaton on AFLW for face algnment. TABLE II SPEED COMPARISON OF OUR METHOD AND OTHER METHODS Method GPU Speed Ours Nvda Ttan Black 99 FPS Cascade CNN [19] Nvda Ttan Black 100 FPS Faceness [11] Nvda Ttan Black 20 FPS DP2MFD [27] Nvda Tesla K FPS used for boundng box regresson, and landmark faces are used for facal landmark localzaton. Total tranng data are composed of 3:1:1:2 (negatves/postves/part face/landmark face) data. The tranng data collecton for each network s descrbed as follows: 1) P-Net: We randomly crop several patches from WIDER FACE [24] to collect postves, negatves, and part face. Then, we crop faces from CelebA [23] as landmark faces. 2) R-Net: We use the frst stage of our framework to detect faces from WIDER FACE [24] to collect postves, negatves, and part face whle landmark faces are detected from CelebA [23]. 3) O-Net: Smlar to R-Net to collect data, but we use the frst two stages of our framework to detect faces and collect data. B. Effectveness of Onlne Hard Sample Mnng To evaluate the contrbuton of the proposed onlne hard sample mnng strategy, we tran two P-Nets (wth and wthout onlne hard sample mnng) and compare ther performance on D. Evaluaton on Face Detecton To evaluate the performance of our face detecton method, we compare our method aganst the state-of-the-art methods [1], [5], [6], [11], [18], [19], [26] [29] n FDDB, and the state-of-the-art methods [1], [11], [24] n WIDER FACE. Fg. 4(a) (d) shows that our method consstently outperforms all the compared approaches by a large margn n both the benchmarks. We also evaluate our approach on some challengng photos. 1 E. Evaluaton on Face Algnment In ths part, we compare the face algnment performance of our method aganst the followng methods: RCPR [12], TSPM [7], Luxand face SDK [17], ESR [13], CDM [15], SDM [21], and TCDCN [22]. The mean error s measured by the dstances between the estmated landmarks and the ground truths, and normalzed wth respect to the nterocular dstance. Fg. 5 shows that our method outperforms all the state-of-the-art methods wth a margn. It also shows that our method shows less superorty n mouth corner localzaton. It may result from the small varances of expresson, whch has a sgnfcant nfluence n mouth corner poston, n our tranng data. F. Runtme Effcency Gven the cascade structure, our method can acheve hgh speed n jont face detecton and algnment. We compare our method wth the state-of-the-art technques on GPU and the results are shown n Table II. It s noted that our current mplementaton s based on unoptmzed MATLAB codes. IV. CONCLUSION In ths letter, we have proposed a multtask cascaded CNNs-based framework for jont face detecton and algnment. Expermental results demonstrated that our methods consstently outperform the state-of-the-art methods across several challengng benchmarks (ncludng FDDB and WIDER FACE benchmarks for face detecton, and AFLW benchmark for face algnment) whle acheves real-tme performance for VGA mages wth mnmum face sze. The three man contrbutons for performance mprovement are carefully desgned cascaded CNNs archtecture, onlne hard sample mnng strategy, and jont face algnment learnng. 1 Examples are shown n

5 ZHANG et al.: JOINT FACE DETECTION AND ALIGNMENT USING MULTITASK CASCADED CONVOLUTIONAL NETWORKS 1503 REFERENCES [1] B. Yang, J. Yan, Z. Le, and S. Z. L, Aggregate channel features for mult-vew face detecton, n IEEE Int. Jont Conf. Bometrcs, 2014, pp [2] P. Vola and M. J. Jones, Robust real-tme face detecton, Int. J. Comput. Vs., vol. 57, no. 2, pp , [3] M. T. Pham, Y. Gao, V. D. D. Hoang, and T. J. Cham, Fast polygonal ntegraton and ts applcaton n extendng Haar-lke features to mprove object detecton, n IEEE Conf. Comput. Vs. Pattern Recognt., 2010, pp [4] Q. Zhu, M. C. Yeh, K. T. Cheng, and S. Avdan, Fast human detecton usng a cascade of hstograms of orented gradents, n IEEE Comput. Conf. Comput. Vs. Pattern Recognt., 2006, pp [5] M. Mathas, R. Benenson, M. Pedersol, and L. Van Gool, Face detecton wthout bells and whstles, n Eur. Conf. Comput Vs.,2014,pp [6] J. Yan, Z. Le, L. Wen, and S. L, The fastest deformable part model for object detecton, n IEEE Conf. Comput. Vs. Pattern Recognt., 2014, pp [7] X. Zhu and D. Ramanan, Face detecton, pose estmaton, and landmark localzaton n the wld, n IEEE Conf. Comput. Vs. Pattern Recognt., 2012, pp [8] M. Köstnger, P. Wohlhart, P. M. Roth, and H. Bschof, Annotated facal landmarks n the wld: A large-scale, real-world database for facal landmark localzaton, n IEEE Conf. Comput. Vs. Pattern Recognt. Workshops, 2011, pp [9] A. Krzhevsky, I. Sutskever, and G. E. Hnton, ImageNet classfcaton wth deep convolutonal neural networks, n Adv. Neural Inf. Process. Syst., 2012, pp [10] Y. Sun, Y. Chen, X. Wang, and X. Tang, Deep learnng face representaton by jont dentfcaton-verfcaton, n Adv. Neural Inf. Process. Syst., 2014, pp [11] S. Yang, P. Luo, C. C. Loy, and X. Tang, From facal parts responses to face detecton: A deep learnng approach, n IEEE Int. Conf. Comput. Vs., 2015, pp [12] X. P. Burgos-Artzzu, P. Perona, and P. Dollar, Robust face landmark estmaton under occluson, n IEEE Int. Conf. Comput. Vs., 2013, pp [13] X. Cao, Y. We, F. Wen, and J. Sun, Face algnment by explct shape regresson, Int. J. Comput. Vs., vol. 107, no. 2, pp , [14] T. F. Cootes, G. J. Edwards, and C. J. Taylor, Actve appearance models, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp , Jun [15] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas, Pose-free facal landmark fttng va optmzed part mxtures and cascaded deformable shape model, n IEEE Int. Conf. Comput. Vs., 2013, pp [16] J. Zhang, S. Shan, M. Kan, and X. Chen, Coarse-to-fne auto-encoder networks (CFAN) for real-tme face algnment, n Eur. Conf. Comput. Vs., 2014, pp [17] Luxand Incorporated: Luxand face SDK. [Onlne]. Avalable: [18] D. Chen, S. Ren, Y. We, X. Cao, and J. Sun, Jont cascade face detecton and algnment, n Eur. Conf. Comput. Vs., 2014, pp [19] H. L, Z. Ln, X. Shen, J. Brandt, and G. Hua, A convolutonal neural network cascade for face detecton, n IEEE Conf. Comput. Vs. Pattern Recognt., 2015, pp [20] C. Zhang and Z. Zhang, Improvng multvew face detecton wth multtask deep convolutonal neural networks, n IEEE Wnter Conf. Appl. Comput. Vs., 2014, pp [21] X. Xong and F. Torre, Supervsed descent method and ts applcatons to face algnment, n IEEE Conf. Comput. Vs. Pattern Recognt., 2013, pp [22] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, Facal landmark detecton by deep mult-task learnng, n Eur. Conf. Comput. Vs., 2014, pp [23] Z. Lu, P. Luo, X. Wang, and X. Tang, Deep learnng face attrbutes n the wld, n IEEE Int. Conf. Comput. Vs., 2015, pp [24] S. Yang, P. Luo, C. C. Loy, and X. Tang, WIDER FACE: A Face detecton benchmark, arxv: [25] V. Jan and E. G. Learned-Mller, FDDB: A benchmark for face detecton n unconstraned settngs, Unv. Massachusetts, Amherst, MA, USA, Tech. Rep. UMCS , [26] B. Yang, J. Yan, Z. Le, and S. Z. L, Convolutonal channel features, n IEEE Int. Conf. Comput. Vs., 2015, pp [27] R. Ranjan, V. M. Patel, and R. Chellappa, A deep pyramd deformable part model for face detecton, n IEEE Int. Conf. Bometrcs Theory, Appl. Syst., 2015, pp [28] G. Ghas and C. C. Fowlkes, Occluson coherence: Detectng and localzng occluded faces, arxv: [29] S. S. Farfade, M. J. Saberan, and L. J. L, Mult-vew face detecton usng deep convolutonal neural networks, n ACM Int. Conf. Multmeda Retreval, 2015, pp [30] K. He, X. Zhang, S. Ren, and J. Sun, Delvng deep nto rectfers: Surpassng human-level performance on ImageNet classfcaton, n IEEE Int. Conf. Comput. Vs., 2015, pp

Face Detection with Deep Learning

Face Detection with Deep Learning Face Detecton wth Deep Learnng Yu Shen Yus122@ucsd.edu A13227146 Kuan-We Chen kuc010@ucsd.edu A99045121 Yzhou Hao y3hao@ucsd.edu A98017773 Mn Hsuan Wu mhwu@ucsd.edu A92424998 Abstract The project here