Robust Object detection for tiny and dense targets in VHR Aerial Images

Size: px

Start display at page:

Download "Robust Object detection for tiny and dense targets in VHR Aerial Images"

Griselda Stephens
5 years ago
Views:

1 Robust Object detection for tiny and dense targets in VHR Aerial Images Haining Xie 1,Tian Wang 1,Meina Qiao 1,Mengyi Zhang 2,Guangcun Shan 1,Hichem Snoussi 3 1 School of Automation Science and Electrical Engineering, Beihang University, Beijing, China; 2 College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China; 3 Institute Charles Delaunay-LM2S-UMR STMR 6279 CNRS, University of Technology of Troyes, France. xiehaining@buaa.edu.cn, wangtian@buaa.edu.cn, meinaqiao@buaa.edu.cn, myzhang@njtech.edu.cn, gcshan@buaa.edu.cn, hichem.snoussi@utt.fr Abstract Object detection in Very High Resolution (VHR) optical remote sensing images is a challenged work for objects are usually dense and tiny. With random orientation, various backgrounds as well as unpredictable noise make traditional image processing methods perform badly. In this paper, we propose using state-of-art Region-based fully convolutional networks to solve object detection tasks in aerial images. To make the whole system efficient we choose to utilize position-sensitive score maps which not only fully take advantage of the convolutional feature maps but also achieve a balance between translation-variance in object detection and translation-invariance in classification. In addition, with 101-layer Residual networks as feature extractors, we achieve a satisfying result which is low time consuming and shows percent and percent precision respectively on two datasets. Index Terms Object Detection, Convolutional Neural Network, NWPU VHR-10 dataset, Pri-SDL dataset I. INTRODUCTION With excellent abstraction ability of deep convolutional neural network, related measures have been tried in remote sensing images to help solve object detection and classification issues. Also, in the past few years, a nearly complete framework of object detection for common scenes has been developed. First of the following parts will generally summarize the major progress of object detection for common scenes. Then the second part will focus on the latest researches on object detection in aerial images. A. Object detection models in common images Different convolutional neural networks like Alex net[1], VGG net[2], Google net[3] have been proposed and achieved an outstanding result in classification which was even better than human expert. An original object detection model of a prevalent family[4] [5] [6], RCNN[4] utilized those excellent classification neural netwoks as fundamental feature extractors. Even though it made a great progress in object detection work, it computed slowly due to repetitive computational work and was hard to be trained for its complicated components. Fixed image size was another annoying limit caused by fully connected layer. In this version, selective search[7] has been chosen to propose region of interest(roi) among several methods proposed recently like objectness[8], multiscale combinatorial grouping[9], category-independent object proposals[10]. The successors of RCNN are SPP net[11] and Fast RCNN[5]. SPP net suggested spatial pyramid pooling to free the limit of fixed image size while Fast RCNN proposed ROI max pooling.generally, ROI max pooling layer is a particular case of spatial pooling layer. Fast RCNN used a shared convolutional layer of the whole image as its feature map, which reduced superfluous computation and trained 9 faster than RCNN. With all these improvements, Fast RCNN became a mearly end-to-end neural network, which meant it can be trained more easily. The latest version is Faster RCNN[6] which suggested regional proposal network which was a little fully convolutional network merged with the whole network to propose ROI more efficiently. Another creative thought was anchors which can be thought as a regression method of spatial pyramid pooling[6]. All these composed a unified convolutional network which can be trained and used efficiently. In this paper, we propose using Regionbased fully convolutional network which is an improvement of the former prevalent family inspired by recent semantic segmentation tasks[12] for object detection in VHR aerial images. It proposes position-sensitive score maps[13] to make a compromise of basic conflict in object detection. B. Object detection in VHR optical remote sensing images According to past researches, considerable traditional methods have been developed to detect tiny and dense objects in aerial images. Several hand-craft features like HOG[14],Haarlike[15],LBP[16] were used to represent objects. These approaches were easy to understand and stayed at a primitive stage and usually needed to change according to specific environments. Also they did not present satisfying results and did cost a lot of computational resources because multiple oriented channels were usually used to merge results. Nowadays convolutional layers and pooling layers show robust abstraction ability, however, previous experts tried hard to manually find out invariant features and specific transformation. Then latest researches used basic convolutional neural network which showed robust abstraction ability in object detection. The paper[17] used AlexNet[1] as their features extractor and conducted a coarse-location-fine-classification pipeline which was a particular adaptation to usually utilized

2 Fig. 1: Basic framework of Region-based Fully Convolutional Network. At first, images are convolved by ResNet-101 to generate feature maps. Then feature maps are convolved by two seperate networks, which are RPN(Region Proposal Network) and position-sensitive score maps. By pooling the ROI proposed by RPN, k k bins are used to vote. Selective Research method. And the paper[18] nearly chose the same settings as the former one but proposed a new rotation-invariant layer which just optimized the multinomial logistic regression objective. However, they paid attention to the invariant of rotation which was actually solved by common convolutional neural network. Then several neural network methods were introduced mainly to solve classification problems. We find out that all these methods used in aerial images are primitive and do cost a lot of computational resources. Their methods were not end-to-end, so it was complex and difficult when training. Also the particular differences between aerial images and images of common scenes were not fully discussed which might lead to a biased way. C. Proposal At first a comparison between SSD[19] and RFCN[13] was conducted in aerial images,and RFCN was chosen because of its accuracy and expandability. We propose using Regionbased Fully Convolutional Networks[13] as an elegant and end-to-end framework and suggest different parameters fine tuning methods for various conditions. After a comprehensive comparative analysis, we achieve a more accurate result which can adapt to scenes with dense and tiny objects. II. OUR APPROACH Figure 1 illustrates the basic framework of our structure. It contains two major parts: a residual neural network used as feature extractor and a popular two-stage object detection pipeline following. The object detection consists of two subsections: one is to propose ROI(region of interest) while the other is to classify ROI. A. Data Preparation A large dataset is usually necessary to train a neural network and several data argumentation methods like mirroring,rotating or scaling the image are usually used. We propose using a pretrained model(firstly we use a model trained on a dataset of pets) can significantly reduce the needs of extra data, however a dataset with only 100 images can also perform well in our test. B. Residual Neural Network As the backbone architecture, ResNet-101[20] is the basis of R-FCN, which is chosen because it performs excellent in ImageNet Classification competitions. And like GooLeNets[3], ResNet-101 is by design fully convolutional. The fully convolutional structure, instead of fully connected structure, gets rid of the limit of fixed image size which means it does not need to face the annoying image resizing issues. There are 100 convolutional layers in ResNet-101 while global average pooling, instead of usually used max pooling layers, and C-class fc layers are used to follow[20]. (C represents the number of classes) However, as the successor of the prevalent R-CNN family,only convolutional layers are kept to generate feature maps. To reduce the computational time in training, we take advantage of the transfer learning which shows even an unrelated pretrained model can be better than randomly initialized parameters. Our model is pre-trained on COCO dataset. Originally the ResNet ends up with a d convolutional block, however, to reduce the dimension, a 1024-d 1 1 convolutional layer which is randomly initialized is attached at the end. Then a (C +1) k-channel layer will be applied to generate position-sensitive score maps. C. Score maps and pooling layers Every ROI rectangle is divided into k kbins to encode position information. Fig 1 shows this structure and set k = 3.When pooling, we can take advantage of the following formula. r c (i, j θ) = (x,y) bin(i,j) z i,j,c (x + x 0,y+ y 0 θ)/n, (1) In the formula, r represents the response of the (i,j)-th bin for the c-th class after pooling. D. Adaption to dense and tiny objects This model performs well in images with common scenes where objects usually occupy large areas,however, in VHR aerial images,objects like planes or cars are tiny and distribute randomly. Usually objects perform a dense distribution which makes this task challenging. We propose that the size of the input feature maps and the size of the anchors need to be fine tuned to adapt to VHR aerial images especially for dense and tiny targets. III. EXPERIMENTS We evaluate the model on two publicly available plane datasets provided respectively by Pri-SDL dataset [17] and NWPU VHR-10 dataset [18]. The first dataset is divided into two subsets.one contains 500 images which are for training while the other contains 100 images used for evaluation. The second dataset is fully used to evaluate.

Fig. 2: Precision-versus-recall curves A. Details of settings Several parameters are set manually before training according to experience.

Anchors scale and Aspect ratio are chosen to use default values. Height stride and width stride are both set to 16.Using GTX 1060 with 6GB memory, 2 to 3 hours were used to train.

Performence on Pri-SDL dataset According to [17], there are 600 images with 3210 plane samples in all, which are collected from Google Earth.

overlapping area is larger than 50%..Fig 2 shows the Precision-versus-recall curves compared with the results given by [17]. Setting the recall rate at 0.

9 It can be seen that both DCNN and R-FCN outperform the ACF method and R-FCN is more robust that DCNN which keeps 0.9441 precision rate when recall achieves 0.9723.

Evaluation on NWPU VHR-10 dataset Totally there are 80 images with 757 airplane samples in NWPU VHR-10 dataset,the spatial resolution of which range from 0.5 to 2m.

3 Fig. 2: Precision-versus-recall curves A. Details of settings Several parameters are set manually before training according to experience. Because we use the transfer learning method, a pre-trained model on non-relative dataset, we find that setting the initial learning rate at performs well. Anchors scale and Aspect ratio are chosen to use default values. Height stride and width stride are both set to 16.Using GTX 1060 with 6GB memory, 2 to 3 hours were used to train.the following experiments are all using this trained model. B. Performence on Pri-SDL dataset According to [17], there are 600 images with 3210 plane samples in all, which are collected from Google Earth. The evaluation standard is the same as the PASCAL VOC object detection evaluation protocol[21], which rules that the detected bounding boxes are considered to be matched with the ground truth when overlapping area is larger than 50%..Fig 2 shows the Precision-versus-recall curves compared with the results given by [17]. Setting the recall rate at 0.9, we compare precision and present the result in Table I. Feature ACF FC6+FC7 POOL5+FC6+FC7 R-FCN Plane TABLE I: Performance when setting recall at 0.9 It can be seen that both DCNN and R-FCN outperform the ACF method and R-FCN is more robust that DCNN which keeps precision rate when recall achieves Also it can be inferred that R-FCN is time saving because DCNN does not share convolutional features and performs repeated computation.fig 3shows some final examples. C. Evaluation on NWPU VHR-10 dataset Totally there are 80 images with 757 airplane samples in NWPU VHR-10 dataset,the spatial resolution of which range from 0.5 to 2m. We use this whole dataset only for evaluation. The former R-FCN model trained on [17] dataset has been used with no more new adaption, which can also prove the robustness.precision-recall curve and Average Precision,both Fig. 3: The final results of Pri-SDL dataset. are standard and wieldly used, are performed to evaluate and compare. Then the average running time per image is compared to present the efficiency of R-FCN in aerial images. BoW SSCBoW FDDL COPD Transferred CNN RICNN R-FCN TABLE II: Preformance Comparisons of AP Values Fig 5 show that RICNN with fine-tuning performs significantly better than previous methods like COPD.However, it can be seen that the Precision-Recall curve of R-FCN is above the curve of RICNN and keeps precision rate of when recall is at This shows that R-FCN is more robust than RICNN in detecting airplanes in VHR aerial images. Fig 6 shows the results generated by our trained model. Table.II compares the AP values of different algorithms while Fig. 4 presents the average running time per image. All

Fig. 4: Time consuming of different algorithms 3UHFLVLRQ YHUVXV UHFDOO FXUYHV RI SODQH GHWHFWLRQ 5 )&1 5,&11 &23' 3UHFLVLRQ 5HFDOO Fig.

targets than common scene.but we ﬁnd out that our approach can adapt these scenes well after ﬁne tune the relative parameters.

By ﬁne-tuning the size of the feature extraction maps and anchors, the R-FCN model shows well in scenes with even more dense and tiny

C ONCLUSION In this paper, we propose using the object detection model, R-FCN, in VHR aerial images.

4 Fig. 4: Time consuming of different algorithms 3UHFLVLRQ YHUVXV UHFDOO FXUYHV RI SODQH GHWHFWLRQ 5 )&1 5,&11 &23' 3UHFLVLRQ 5HFDOO Fig. 5: Precision-Recall curve these show that R-FCN used in detecting airplanes in VHR aerial images is not only more robust and accurate,but also more time-saving. D. Dense and tiny targets The challenging part of object detection in VHR aerial images is that it usually has dense and tiny detection targets than common scene.but we ﬁnd out that our approach can adapt these scenes well after ﬁne tune the relative parameters. Though R-FCN gets rid of the limit of ﬁxed image size, it can also be affected by the proportion between the target and the whole scene. By ﬁne-tuning the size of the feature extraction maps and anchors, the R-FCN model shows well in scenes with even more dense and tiny targets. Fig.7show some examples. IV. C ONCLUSION In this paper, we propose using the object detection model, R-FCN, in VHR aerial images. We prove that pre-trained models from even non-ralative dataset may reduce training time dramatically. After comparing with other detection algorithms used in aerial images on different datasets, the results show that R-FCN is more robust and gain a higher AP score. Also we ﬁne-tuning the size of feature extraction maps and anchors Fig. 6: The results of detecting planes in NWPU VHR-10 dataset

[14] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, vol. 1, pp. 886 893, 2005. [15] P. Viola and M.

Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp.

5 [14] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, vol. 1, pp , [15] P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, vol. 1, pp , [16] T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp , [17] H. Q. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, Orientation robust object detection in aerial images using deep convolutional neural network, pp , [18] G. Cheng, P. Zhou, and J. Han, Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images, IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 12, pp , [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg, Ssd: Single shot multibox detector, pp , [20] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, pp , [21] M. Everingham, S. M. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The pascal visual object classes challenge: A retrospective, International Journal of Computer Vision, vol. 111, no. 1, pp , Fig. 7: The results of detecting dense and tiny planes. when facing more dense and tiny targets, which also obtains a satisfying results. REFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, pp , [2] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, pp. 1 9, [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, pp , [5] R. Girshick, Fast r-cnn, pp , [6] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 1, [7] J. Uijlings, K. E. Sande, T. Gevers, and A. Smeulders, Selective search for object recognition, International Journal of Computer Vision, vol. 104, no. 2, pp , [8] B. Alexe, T. Deselaers, and V. Ferrari, Measuring the objectness of image windows, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp , [9] P. Arbelaez, J. Ponttuset, J. Barron, F. Marques, and J. Malik, Multiscale combinatorial grouping, pp , [10] I. Endres and D. Hoiem, Category independent object proposals, pp , [11] K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp , [12] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, pp , [13] J. Dai, Y. Li, K. He, and J. Sun, R-fcn: Object detection via regionbased fully convolutional networks, pp , 2016.

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep