PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL Yingxin Lou 1, Guangtao Fu 2, Zhuqing Jiang 1, Aidong Men 1, and Yun Zhou 2 1 Beijing University of Posts and Telecommunications, Beijing, P.R. China {louyingxin; jiangzhuqing; menad}@bupt.edu.cn 2 Academy of Broadcasting Science, Beijing, P.R. China {fuguangtao; zhouyun}@abs.ac.cn ABSTRACT Our Pt-Net is a novel object detection network based on a pre-trained and multi-feature VGG-16 network. Firstly, Pt- Net is initialized by a pre-trained VGG-16 model and its own CNN output via a linear combination. Secondly, Pt-Net generates proposals via particle filter method on Conv5 feature map and crops the multi-feature maps which are combined by fusing hierarchical CNN features in corresponding positions. After that, we apply multi-feature concatenation for the cropped parts for more image feature information and adopt a novel two-dimensional overlap area loss function for localization. Finally, we apply our Pt-Net on both object detection task and face detection task which are trained on the PAS- CAL VOC dataset and WIDER FACE dataset. Pt-Net can achieve a map of 76.8% on the detection of PASCAL VOC 2007 dataset and state-of-the-art results on the FDDB benchmark at 43 fps on an NVIDIA GTX 1070p GPU. Index Terms Convolutional Neural Networks, Pretrained model, Particle filter, Multi-feature, Overlap Loss 1. INTRODUCTION Deep convolutional neural networks (CNNs) [1] have been used in many domains especially for computer vision tasks and made impressive improvements such as object detection which includes determining their categories (object classification) and finding the locations of objects in an image (object localization). Since the successful usage of trained CNNs on the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [1], object detection using CNNs has made it possible to replace the traditional image features such as SIFT [2] and HOG [3] with high level object representations obtained from the output of a CNN model. The CNN-based object detectors have achieved state-of-the-art results in many application tasks such as object detection [4, 5, 6, 7, 8], face detection [9, 10, 11, 12] and others [13, 14]. State-of-the-art object detection methods, such as R- CNN [4] proposed by Girshick et al., typically adopt region Thanks to National Science Foundation of China (NO.61671077, NO.61671264) for funding. proposal methods which is a pre-processing step to provide a set of candidate bounding boxes that roughly localize the objects in an image and then refine the rough proposals to achieve precise localization. Selective Search (SS) [15] which greedily merges superpixels based on engineered lowlevel features is one of the most popular proposal methods. Then, the proposals in each image are warped into a fixed size for CNNs input and transformed into 4096-dimensional feature vectors. However, R-CNN has a problem of heavily computational burden in extracting features for each proposal. Hence, SPP-net [16] and Fast R-CNN [5] are proposed to extract image features for once through CNNs. Fast R-CNN [5] with region proposal network has been an impressive detector based on the PASCAL VOC [17] and ImageNet [1] datasets. However, region proposal generation is implemented on GPU at nearly 2s per image which can be a major computational bottleneck in the detection pipeline. To reduce time, Faster R-CNN [6] proposes RPNs (Regional Proposal Networks) which share convolutional (Conv) layer parameters and use 3 scales and 3 aspect ratios for 9 anchors in each grid cell and gets high-quality region proposals via the VGG-16 [18] network. However, Faster R-CNN [6] still has a few drawbacks: (1) The classical VGG-16 [18] network architecture for Faster R-CNN [6] is only initialized by the ILSVRC [1] parameter weights. However, the output of CNNs is also meaningful and we can combine them for more precise detection. (2) It only adopts Conv5 feature map which is not accurate enough for object detection and bounding boxes cannot cover objects nicely for localization. We can combine multiple layers because lower layer has natural high-resolution features for localization and higher layer has more semantic information for classification. (3) Faster R-CNN [6] generates proposals using 9 anchors which is coarse for enclosing the various objects. So we propose a rapid method of generating proposals via particle filter. (4) We cannot enclose objects precisely via traditional smooth L1 bounding-box regression which has poor adaptive ability. Therefore, we are thinking about adding coordinate dimension with a two-dimensional overlap area loss function. 978-1-5090-5990-4/17/$31.00 2017 IEEE 1280 GlobalSIP 2017

Fig. 1: Pt-Net architecture. Our model (1) is firstly initialized by a pre-trained VGG-16 model and its own CNN output via a linear combination, (2) inputs an image to the optimally pre-trained CNNs, (3) aggregates the outputs of selected layers into multi-feature maps, (4) generates proposals on Conv5 feature map via particle filter method, (5) maps the proposals to multifeature maps and crops on the maps, (6) concatenates the cropped regions and (7) classifies and localizes via a novel overlap loss function for object detection task and face detection task. To summarize, our Pt-Net can achieve higher accuracy and lower time for both tasks of object detection and face detection. Our contributions are: Pre-trained VGG-16 network. Classical VGG-16 network architecture is always directly initialized from the ILSVRC which ignores the output of CNN features. So, we propose a linear combination of ImageNet caffemodel and CNN output via proportion parameters. Multi-feature architecture. As we know, higher layers represent more semantic information for better classification and lower layers represent more original image information for more precise localization. Hence, we adopt multi-feature maps and concatenation from different layers. Particle filter for proposals. Traditional proposals generation using SS or RPN is time-consuming or less precise. Therefore, we consider sampling proposals via particle filter according to object features which can realize faster and more accurate detection. A novel overlap loss function. We propose a novel overlap loss function which regresses the bounding box as a two-dimensional overlap area loss instead of four independent variables in one-dimensional coordinate so optimization has the unity of entirety. 2.1. Pre-trained model 2. METHODS The CNN architecture for detection needs much time to train parameters. Hence, we always initialize networks using pretrained VGG-16 caffemodel from ImageNet and then finetune CNN parameters. However, traditional methods directly use the pre-trained model and ignore its own output which means a lot. So we consider combining ImageNet caffemodel and CNN output via proportion parameters as following: F = C F 1 + (1 C) F 2 (1) where F 1 is the CNN output and F 2 is the pre-trained caffemodel. C N(µ, σ 2 ) which represents proportion parameter is a stochastic variable obeying Gaussian distribution. In actual scenes, we choose another means of expression for C: C = µ + σ e (2) where e N(0, 1) stands for standard Gaussian distribution. As a result, we can compute the 13 convolutional layers gradients of VGG-16 for SGD (stochastic gradient descent) [19] during back propagation passing by the addition part. 2.2. Multi-feature Most state-of-the-art detectors such as Faster R-CNN [6] only use the last convolutional layer output as feature map. However, the single layer feature map cannot represent perfectly both classification and localization information. As we know, lower layers have more original information for better localization and higher layers have more semantic information for classification [20, 21, 22]. To combine the advantages of both sides, we fuse the the output of Conv1, Conv2, Conv3 and Conv5 layers and then they are connected into multi-feature maps. After the proposals mapping to multi-feature maps, a 1 1 convolution is firstly applied to preserve the receptive field of 1281

the previous layer and reduce computation before the 3 3 and 5 5 convolution. Next, a max pooling layer for multi-feature maps and a 1 1 convolution adding concatenated rectified linear unit (C.ReLU) [23] which can reduce half of the computation are used to obtain object spatial localization and full context features [21]. Finally, all the parts are concatenated for more precise detection. The coordinates are related to each other, so we consider regressing the overlap area between the two boxes: x 1 = max(x 1, x 1 ) y 1 = max(y 1, ỹ 1 ) x 2 = min(x 2, x 2 ) y 2 = min(y 2, ỹ 2 ) I = (x 2 x 1) (y 2 y 1) U = (x 2 x 1 ) (y 2 y 1 ) + ( x 2 x 1 ) (ỹ 2 ỹ 1 ) I 2.3. Proposals generation Overlap loss = ln I U (4) Fast R-CNN [5] adopts SS which is a time-consuming step to get object proposals. Faster R-CNN [6] generates proposals by RPN which need a 3*3 sliding window on the feature map and then scores each anchors. In this paper, we propose a new method of generating proposals based on particle filter [24, 25]. Firstly, we divide Conv5 feature map into a 6*6 grid and generate proposals for 32*32 pixels at each grid cell center. Secondly, according to the generated proposals features, we can set particles around targets via Gaussian distribution which means that we put more particles z i close to ground truth Z and less particles away from it. Thirdly, we need to compute and normalize the similarity w i between proposals x and particles z i and choose the most similar proposals. Finally, we repeat for N times to update the original proposals x = ( x 1, ỹ 1, x 2, ỹ 2 ) and get the new ones x = (x 1, y 1, x 2, y 2 ) as following: w i = p( x z i ) = 1 exp [ ( x zi ) 2 2πσ 2σ 2 ] w i w i = N i=1 w i N x = w i x (3) 2.4. Overlap loss i=1 where I is the intersection area and U is the union area of proposals and ground truth boxes. During back propagation, we try to reduce the localization loss via SGD [19] algorithm and compute the gradients as follows: L = U I I U U I U 2 ( I = UI = U I I U UI ) U where is a positive part and is a negative part which means we try to enlarge the intersection area and diminish the union area for minimizing loss. The partial derivative of x (the same to y) is described as follows: x 1 = y 1 y 2 x 1 = y 1 y 2(x 1 > x 1 ) (5) = y 2 y 1 x 2 = y 2 y x 1(x 2 < x 2 ) (6) 2 3. EXPERIMENTAL RESULTS 3.1. Experimental setup 3.1.1. For object detection We evaluate Pt-Net on PASCAL VOC 2007 [17] dataset which is the object detection benchmark and compare our results with state-of-the-art methods which are initialized by an ImageNet [1] pre-trained VGG-16 network. The mini-batch SGD in fine-tuning is set to 60k and the size of mini-batch is set to 10. We use 0.9 momentum and 0.001 initialized learning rate. The learning rate is then decreased by a factor of 0.1 after a set of iterations such as when detecting a plain. Fig. 2: Overlap loss between proposal and ground truth box Faster R-CNN [6] depicts object bounding boxes with 4 coordinate variables which are optimized respectively via s- mooth L1 loss [6]. In Fig. 2, the purple predicting box can be defined as a 4-dimensional vector x = (x 1, y 1, x 2, y 2 ) and the red ground truth box can be defined as x = ( x 1, ỹ 1, x 2, ỹ 2 ). 3.1.2. For face detection We train Pt-Net on WIDER FACE [26] dataset which includes 12,880 images and 159,424 faces in the training set. And our Pt-Net is evaluated on the FDDB [27] benchmark which contains 5,171 annotated faces in 2,845 images. A face bounding box is regarded as true positive if it has an Intersection over Union (IoU) larger than 0.5 with a face ground truth. 1282

Table 1: Detection results on VOC 2007 test set. All methods use 07++12 train set (union of VOC07 trainval, VOC07 test, and VOC12 trainval) and VGG-16 network. PT: pre-trained, PF: particle filter (C=0.5), MF: multi-feature, OL: overlap loss. Method PT PF MF OL map aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv Faster [6] 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6 Ours[1] 74.2 77.3 80.0 69.3 66.8 58.1 81.7 85.6 86.8 57.3 78.9 68.8 85.0 85.2 77.0 77.5 50.4 72.7 73.2 84.2 73.8 Ours[2] 73.6 77.4 78.6 70.5 65.3 55.6 81.3 83.8 88.5 55.3 79.5 65.2 85.1 84.1 78.4 74.8 49.5 72.1 72.4 82.7 71.9 Ours[3] 74.1 77.6 78.1 69.8 67.3 58.9 83.2 81.7 89.1 52.8 80.3 65.4 84.6 85.5 77.7 76.5 50.1 73.6 76.5 80.2 73.1 Ours[4] 74.2 74.6 77.9 71.3 66.6 61.3 80.1 80.3 87.6 59.3 78.9 66.9 86.4 85.8 76.6 75.3 54.6 71.1 74.8 81.7 72.9 Ours[5] 76.8 78.4 80.3 72.7 68.7 63.3 83.6 85.7 89.7 60.2 82.5 69.4 87.8 86.3 81.3 79.9 55.1 75.1 77.9 85.0 74.0 Table 2: Different proportion C of the pre-trained VGG-16 model and CNN output in the linear addition structure. C map (VOC2007) 0 70.6 0.3 73.7 0.5 76.8 0.7 71.3 1.0 68.8 3.2. Object detection results 3.2.1. Pre-trained model parameters Table 1 shows that Pt-Net can achieve a 76.8% map (mean Average Precision) based on the four proposed methods and is 3.6% map higher than Faster R-CNN [6]. If only adding the pre-trained method (C=0.5), our model can improve 1.0% map than baseline [6]. Table 2 shows the detection results of various proportion C. As we can see, C=0.5 can reach the best result which means pre-trained model and its own CNN output are both important. Higher or lower than 0.5, map is in a decreasing trend so we choose 0.5 in our framework. 3.2.2. Overlap loss and Multi-feature Fig. 3 shows the performances of smooth L1 and overlap loss based on Faster R-CNN [6]. As we can see from the result: (1) overlap loss can handle objects of different scales, (2) overlap loss can enclose objects more tightly than smooth L1 loss, (3) overlap loss can detect small objects better than smooth L1 loss where too small objects are ignored. In addition, overlap loss improves detection result by 1.0% map and 0.9% map for multi-feature compared to [6] in Table 1. Fig. 3: Bounding box regression for localization. Top: s- mooth L1 loss. Bottom: overlap loss. Fig. 4: ROC curves of state-of-the-art face detection methods on FDDB. Left: continuous scores; Right: discrete scores. Pt-Net with state-of-the-art detectors R-CNN, Fast R-CNN, Faster R-CNN via ROC curves on FDDB. The results shows that accuracy ranking is Pt-Net, Faster R-CNN, Fast R-CNN and R-CNN and the speed ranking is the same. Therefore, the proposed model gains both accuracy and speed based on the pre-trained and elaborately designed multi-feature CNNs. As we can see, the true positive rates of continuous and discrete scores are 0.703 and 0.856 at 1000 false positives respectively. The Pt-Net can achieve high true positive rate at less than 200 false positives. Our Pt-Net can run at a high speed of 43 fps (N=15) on images in VGA resolution with an NVIDIA GTX 1070p GPU and is potential to be applied in the real-time face detection system. 3.2.3. Detection speed Our Pt-Net with particle filter method of generating proposals can achieve 13 fps (frames per second) at N=40 iteration resampling times on an NVIDIA GTX 1070p GPU and VOC 2007 test set. Fast R-CNN [5] with SS method has a speed of 0.5 fps and Faster R-CNN [6] uses RPNs instead of SS at a speed of 7 fps so Pt-Net achieves a higher speed than both. 3.3. Face detection results Fig. 4 shows the comparisons among region-based face detection methods on FDDB [27] benchmark. We compare our 4. CONCLUSIONS In the paper, we have introduced Pt-Net detection architecture based on a pre-trained VGG-16 network for both object and face detection tasks. Our Pt-Net achieves a 76.8% map via four methods: (1) a linear combination of pre-trained model and CNN output, (2) multi-feature maps and concatenation from multiple layers, (3) generating proposals via particle filter method, (4) a novel overlap area loss function for localization. In a word, our Pt-Net can perform well both in speed and accuracy comparable to state-of-the-art detectors and can be applied well in many fields like object and face detection. 1283

5. REFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097 1105. [2] D. Lowe, Distinctive image features from scaleinvariant keypoints, IJCV, vol. 60, no. 2, pp. 91 110, 2004. [3] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in CVPR. IEEE, 2005, vol. 1, pp. 886 893. [4] R. Girshick, J. Donahueand T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in CVPR, 2014, pp. 580 587. [5] R. Girshick, Fast r-cnn, in ICCV, 2015, pp. 1440 1448. [6] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in NIPS, 2015, pp. 91 99. [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, arxiv preprint arxiv:1506.02640, 2015. [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, ssd: single shot multibox detector, arxiv preprint arxiv:1512.02325, 2015. [9] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, A convolutional neural network cascade for face detection, in CVPR, 2015, pp. 5325 5334. [10] L. Huang, Y. Yang, Y. Deng, and Y. Yu, Densebox: Unifying landmark localization with end to end object detection, arxiv preprint arxiv:1509.04874, 2015. [11] P. Dollár and C. L. Zitnick, Fast edge detection using structured forests, IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 8, pp. 1558 1570, 2015. [12] S. Yang, P. Luo, C. C. Loy, and X. Tang, From facial parts responses to face detection: A deep learning approach, in ICCV, 2015, pp. 3676 3684. [13] J. Hosang, M. Omran, R. Benenson, and B. Schiele, Taking a deeper look at pedestrians, in CVPR, 2015, pp. 4073 4082. [14] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, Scale-aware fast r-cnn for pedestrian detection, arxiv preprint arxiv:1510.08160, 2015. [15] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, Selective search for object recognition, IJCV, vol. 104, no. 2, pp. 154 171, 2013. [16] K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, in ECCV. Springer, 2014, pp. 346 361. [17] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, The pascal visual object classes (voc) challenge, IJCV, vol. 88, no. 2, pp. 303 338, 2010. [18] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv:1409.1556, 2014. [19] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation, vol. 1, no. 4, pp. 541 551, 1989. [20] T. Kong, A. Yao, Y. Chen, and F. Sun, Hypernet: Towards accurate region proposal generation and joint object detection, arxiv preprint arxiv:1604.00600, 2016. [21] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks, arxiv preprint arxiv:1512.04143, 2015. [22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in CVPR, 2015, pp. 1 9. [23] W. Shang, K. Sohn, D. Almeida, and H. Lee, Understanding and improving convolutional neural networks via concatenated rectified linear units, arxiv preprint arxiv:1603.05201, 2016. [24] K. Nummiaro, E. Koller-Meier, and L. J. V. Gool, Object tracking with an adaptive color-based particle filter, Lecture Notes in Computer Science, vol. 2449, no. 02, pp. 353 360, 2002. [25] E. Yang and M. Jeon, Object tracking with the level set method and the particle filtering, Lecture Notes in Computer Science, 2009. [26] S. Yang, P. Luo, C. C. Loy, and X. Tang, Wider face: A face detection benchmark, in CVPR, 2016, pp. 5525 5533. [27] V. Jain and E. G. Learned-Miller, Fddb: A benchmark for face detection in unconstrained settings, UMass Amherst Technical Report, 2010. 1284