Face Detection with Deep Learning

Face Detecton wth Deep Learnng Yu Shen Yus122@ucsd.edu A13227146 Kuan-We Chen kuc010@ucsd.edu A99045121 Yzhou Hao y3hao@ucsd.edu A98017773 Mn Hsuan Wu mhwu@ucsd.edu A92424998 Abstract The project here presents the mplementaton of face detecton technology by usng deep learnng. The man dea used n ths project s mult-task Cascaded Convolutonal Neural Networks, whch contans three sub-networks together to learn to recognze human faces after several stages of decomposton and flterng. The dataset that s gong to be used s FDDB dataset, whch contans over fve thousand faces n a set of around two thousand and eght hundred mages. 1. Introducton Wth the development of machne learnng and computer vson, face detecton becomes a very popular and useful technology. The hstory of face detecton can be dvded nto three phases. The frst perod was from 1964 to 1990. It was the start of face detecton, and the research n ths perod were manly based on Geometrc features. At ths early stage, there were not many applcatons related to face detecton. The second perod was from 1991 to 1997. Although ths perod was short (only seven years), t was actually a peak tme of face detecton research and development. A lot of new algorthm and busness applcaton on face detecton showed up n ths tme. And the research moved to Egen-face based method nstead of geometrc feature based method. In the thrd perod, from 1998 to now, researchers and scentsts are focusng on face detecton under non deal condtons. For example, f the lght s too strong or people are movng around too fast, t s gong to be hard to track the faces. Hence, people tred to mplement face detecton based on three dmensonal model to have a better track of faces. And for now, face detecton has been mplemented and wdely used n dfferent ndustres and products. Face detecton could be used n the access control system of a door or of a buldng. It could also be used by companes or school to check employees or students attendance. The new Phone released several months ago also uses ths technology to unlock the screen. Face detecton s defntely gong to be an ndspensable part n the ntellgent socety n the near future and face detecton may be mplemented nto even more applcatons such as dgtal passport and so on. As students studyng n machne learnng, we are curous and enthusastc about ths new and popular technology. Hence, n ths project, we are gong to use Deep learnng to detect human faces n mages. 2. Related Work As mentoned n the ntroducton part, the face detecton technology has been studed for over 50 years. Hence, n the hstory, there are a lot of dfferent face detecton methods that are nterestng and each of them has ther own advantages and dsadvantages. For example, the Vola-Jones method[6] s the frst framework that can allow face detecton n real tme. Ths s a three-step framework, whch ncludes features computaton, classfcaton, and combnaton of classfers. Overall, Vola-Jones s a successful method, whch has fast detecton speed and good accuracy and low false postve rate. However, t takes long tme for tranng and has restrcton on dfferent head poses. Local Bnary Pattern(LBP [1]) s another effectve method. In ths method, every pxel s assgned a texture value, whch can be combned wth target for trackng. The advantages of LBP ncludes fast computaton, successful descrpton of texture feature and smple steps. However, t could be only used for gray mages, and t doesn t have good accuracy. Adaboost algorthm[3] was proposed n 2003. AdaBoost s a learnng algorthm that can create a strong classfer by choosng vsual features from a bunch of smple classfers and combnng them lnearly. It s very smple to be mplemented because t doesn t requre any pror knowledge about the face structure. Also, t can be used wth numerous dfferent classfers and mproves classfcaton accuracy. The dsadvantages of Adaboost are that frst t has slow tranng speed, second t s too senstve to nosy data whch can lead to low fnal detecton accuracy. SMQT Features and SNOW Classfer Method[5] s a relatvely new method publshed n 2011. Ths method has two phases. The frst phase s face lumnance, whch get pxel nformaton of the mage. The second phase s detecton whch uses SMQT features as feature extracton and SNOW to speed 1

up the classfer. Ths method s computatonally effcent. But t has low false postve rate. The last method dscussed here s neural network based method[4], whch s the foundaton of the MTCNN[7]. In ths early neural network method, there are two stages: flterng and mergng and arbtatng. The advantages of ths method are acceptable false detecton and acceptable accuracy. And the dsadvantages are slow detecton process and complex methodology. The method we mplemented n our project s MTCNN. Compared to all methods above, MTCNN has the best and ncredbly hgh accuracy. Although t takes some tme for tranng, we can save tme by usng pre-traned model and keep the relatvely hgh accuracy. MTCNN even supports for real tme face detecton. Hence, MTCNN s a very good method for face detecton. 3. Dataset and Features The data used s FDDB dataset [2]. It contans 5171 faces n a set of 2845 (both gray-scale and colored) mages. The dataset was broken down nto 10 folders wth roughly 520 labeled faces each folder. Each mage may contan multple faces. The dataset have labels of the postons of human faces wthn the pcture. The labels were marked by drawng ellpses around each human face, and recordng the radus of both axes, center of the ellpse, and the angle of the ellpse. As seen n 1, the data has multple Fgure 1. Sample data faces wthn the pcture wth dfferent lghtngs. Notce the two faces at the back have poor lghtng and ther bodes, even edges of ther faces are blocked by the other people n front. In 2, the labels are drawn as ellpses. Notce that ellpses may overlap wth each other. In addton, note that the person s face on the left s not labeled. Ths s because only faces wth ether one of the eyes vsble s labeled. The annotators also only label the faces that are at least 20 pxels n both heght and wdth. Snce these faces are annotated Fgure 2. Sample data wth ellptcal label by dfferent annotators, there could be slght dfferences n judgments on whether or not a face should be labeled between dfferent annotators. Moreover, dfferent poses of the faces, occlusons of faces, as well as resoluton all affect the annotatons. Therefore, there are ntrnsc dffcultes n makng a perfect labels. However, these should not affect most annotatons sgnfcantly and the annotatons should stll be relatvely accurate. 4. Method The method we used s called Mult-task Cascaded Convolutonal Networks[7]. Dfferent from other algorthms, MTCNN s cascadng three CNN wth dfferent structures together for face detecton. Fgure 3 s the ppelne of MTCNN, and fgure 4 s the detaled structure of MTCNN. Before we put the testng mage nto classfer, we need to resze the mage n dfferent scales, and stack t nto an mage pyramd. By dong these steps, we can generate the same face n dfferent scales, whch ncreases the ablty of the network. After that, a sldng wndow wll be appled to the pyramd, and break the mage nto regons, whch are the nputs to the network. 4.1. Stage-1 Stage-1 s called P-Net, whch references to Proposal network. It s a shallow network whch wll roughly decde whch regon contans faces. For each proposed boundng box, there wll be three dfferent classfcatons appled to t, whch are face classfcaton, boundng box regresson and facal landmark localzaton. I wll ntroduce these three algorthms n the later secton. The output of P-Net are some proposed boxng boxes and ther classfcaton scores of faces. None maxmum suppresson wll be appled n order to clean the boundng boxes that are overlappng wth each other. Those remanng boxes wll be reszed to 24x24x3, and used as the nput to the next net. 2

The output of ths net are stll boundng boxes and ther classfcaton scores. Those boxes wll be reszed to 48x48x3 to be used as the nput to the next net. 4.3. Stage-3 Stage-3 s called O-Net, whch references to Output Net. Ths network s deeper wth larger convoluton kernel comparng wth prevous nets. Ths powerful network wll make the fnal decson about where the face s and what the sze of boundng box should be. 4.4. Classfcaton and Regresson At the end of three stages, there are the computaton of face classfcaton, boundng box regresson and facal landmark localzaton. Ther loss functon are dfferent, I wll ntroduce them one by one. 4.4.1 Face classfcaton Fgure 3. Ppelne of three stages of MTCNN Whle dong face classfcaton, cross entropy s used as the Loss functon. L det = (y det log(p ) + (1 y det )(1 log(p ))) (1) where p s the probablty produced by the network. y det 0, 1 s the ground-truth label of the faces. 4.4.2 Boundng Box Regresson For the boundng box regresson, Euler dstance(l2 loss) s used as the loss functon. L box = ŷ box y box 2 2 (2) where ŷ box s the regresson target generated by the network and y box s the ground-truth locaton. Fgure 4. Three nets of MTCNN 4.4.3 Facal Landmark Localzaton L2 loss s also used as the loss functon for ths part. 4.2. Stage-2 Stage-2 s called R-Net, whch references to Refne network. Ths Net has convoluton wth larger kernel and a fully connected layer, whch s more powerful than the prevous network. The goal of ths net s to refne the results from prevous net. The boundng boxes whch have low face classfcaton scores wll be dscarded. Agan, for the regons wth hgh classfcaton scores, boundng box regresson and facal landmark localzaton wll be appled. L landmark = ŷ landmark y landmark 2 2 (3) where ŷ landmark s the locaton of facal landmarks generated by the network and y landmark s the ground-truth locaton. These weghted sum of these three loss values s used as the loss for back propagaton. Ther weghts are determnstc values. In the P-Net and R-net, the weghts for face determnaton, boxes regresson and landmark localzaton are 1.0, 0.5 and 0.5. In the last stage, the weghts are 1.0, 0.5, 1.0. 3

5. Experments/Results/Dscusson 5.1. Evaluaton method The algorthm was tested on the FDDB dataset. The evaluaton method used was to compare the coordnate of the ellpse label to the center of the output rectangle. If the two coordnates are wthn 50 pxels both n wdth and heght, the rectangle s consdered a correct detecton. If there s a detecton rectangle but not an ellptcal label near the rectangle, t s consdered a false postve. An example of false postve s shown n 5. Notce the person s face on the left was detected wth a green boundng rectangle, but t was not labeled, snce nether of the person s eyes were vsble. On the other hand, f there s a labeled ellpse but the detector fals to fnd the face, t s consdered a false negatve. An example s shown n 6. Notce a red ellpse n the top center labels a man s face that s partally covered by the person n front. Snce most of the person s face was covered, the detector was unable to detect the face. However, t was stll labeled snce one of the person s eyes s vsble. In [2], the authors proposed a method to evaluate by calculatng the overlap between the boundng rectangle and the ellpse. Although t does gve a better meanng to what a correct detecton s,.e. over 80% overlap between the rectangle and ellpse, t would consder all detecton as false detecton f the rectangle and ellpse have relatvely dfferent szes. Snce, dstance between center pxel s smpler to mplement, t was used to evaluate the results. Note that the percentage overlap s an arbtrary number as the 50 pxels used, and thus ether evaluaton method s ntrnscally mperfect as the labels. Nonetheless, the evaluaton result should stll be a meanngful metrc to test the algorthm. Fgure 6. Example of false negatve Fold False postve Correct detecton Total Faces 1 23 480 515 2 18 485 519 3 25 483 517 4 18 488 517 5 17 486 514 6 21 484 518 7 30 494 518 8 24 468 518 9 15 483 514 10 24 495 521 All 215 4846 5171 Table 1. Detecton results 5.2. Results The results are recorded n Table 1 and Table 2. Snce there are two knds of errors, a false postve and a false negatve, there are two metrcs to evaluate the detector. The accuracy, s the number of correct detectons made dvded by the total number of faces. Ths measures how many of the labeled faces can the detector fnd. The true postve rate, s the number of correct detectons made dvded by the number of all postves. Ths measures out of all the detectons the detector has made, what percentage of whch s a true detecton. 6. Concluson/Future Work Fgure 5. Example of false postve By usng FDDB dataset for evaluaton, our mplementaton of MTCNN got an average of 93.7% accuracy and 95.8% true postve rate. The 4% of false postve rate ndcates that there are some dfferences between the MTCNN detecton and FDDB s label. We looked nto the mspredcton cases, and we found some faces wthout both eyes vsble are stll detected by MTCNN. Snce FDDB only labels faces wth ether one of the eyes vsble, there 4

Fold Accuracy True Postve Rate 1 0.932 0.954 2 0.934 0.964 3 0.934 0.951 4 0.944 0.964 5 0.946 0.966 6 0.934 0.958 7 0.954 0.943 8 0.903 0.951 9 0.940 0.970 10 0.950 0.954 All 0.937 0.958 Table 2. Accuracy and True postve rates are some faces not labeled, whch caused our false postve rate to go hgh. Also, some labeled faces wth extremely low lght condtons or wth other objected covered. Ths knd of faces are dffcult for MTCNN to detect. For the future work, we can mprove our model by tranng more edge cases, lke the ones mentoned above. Also, dong more evaluatons lke we dd on FDDB can help us to fnd the weakness of our model. In addton, tranng on a even larger dataset should allow the network to learn n more extreme cases and therefore have mproved performance. We can also make our MTCNN mplementaton to work on real tme face detecton. Whle the accuracy s hgh enough, we can work on other functonaltes lke face recognton or emoton recognton n the future. References [1] T. Ahonen, A. Hadd, and M. Petkänen. Face recognton wth local bnary patterns. In European conference on computer vson, pages 469 481. Sprnger, 2004. [2] V. Jan and E. Learned-Mller. Fddb: A benchmark for face detecton n unconstraned settngs. Techncal Report UM-CS-2010-009, Unversty of Massachusetts, Amherst, 2010. [3] R. Mer and G. Rätsch. An ntroducton to boostng and leveragng. In Advanced lectures on machne learnng, pages 118 183. Sprnger, 2003. [4] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detecton. IEEE Transactons on pattern analyss and machne ntellgence, 20(1):23 38, 1998. [5] K. Somashekar, C. Puttamadappa, and D. Chandrappa. Face detecton by smqt features and snow classfer usng color nformaton. Internatonal Journal of Engneerng Scence and Technology, 3(2), 2011. [6] P. Vola and M. Jones. Rapd object detecton usng a boosted cascade of smple features. In Computer Vson and Pattern Recognton, 2001. CVPR 2001. Proceedngs of the 2001 IEEE Computer Socety Conference on, volume 1, pages I I. IEEE, 2001. [7] K. Zhang, Z. Zhang, Z. L, and Y. Qao. Jont face detecton and algnment usng multtask cascaded convolutonal networks. IEEE Sgnal Processng Letters, 23(10):1499 1503, Oct 2016. 5