Automatic detection of books based on Faster R-CNN

Automatic detection of books based on Faster R-CNN Beibei Zhu, Xiaoyu Wu, Lei Yang, Yinghua Shen School of Information Engineering, Communication University of China Beijing, China e-mail: zhubeibei@cuc.edu.cn, wuxiaoyu@cuc.edu.cn, young-lad@263.net, shenyinghua@cuc.edu.cn Abstract Detection networks has made improvements continuously like SPPnet and Fast R-CNN. Recently the novel region proposal method RPN shares full-image convolutional features with the detection network and enables a state-of-theart object detection network Faster R-CNN. In this work we apply Faster R-CNN to train a detection network on our digital image database of books and implement automatic recognition and positioning of books. Experiments show that retrained Faster R-CNN achieves fine detection results in terms of both speed and accuracy, and it also solves the problem of testing negative examples in our previous study. This provides great help for the study of practical book retrieval system. Keywords: object detection; detection of books; Faster R- CNN; deep learning I. INTRODUCTION Nowadays new technology and network has made our live more easy and convenience. The intelligent retrieval and management of books has developed for a long time and now has a wide prospect given the latest deep learning methods. The identification of books is the key step of a retrieval system. So far book identification mostly use text-based methods or content-based shallow machine learning methods that require manually extracted feature. Hao Wang and Peng Ye et al have respectively realized the automatic classi fication of Chinese books or journal articles based on support vector machine(svm) and back propagation neural network algorithms [1][2]. And there are numerous studies of the text categorization of books based on machine learning. Deep learning has been a hotspot in recent years and obtained good application effect in the field of image classification, scene recognition and object detection and tracking. There are few studies on the identification of books based on deep learning. We have studied the image identification of 10 classes of books based on SVM and deep learning methods [3]. The experimental test results are good but the system is unfit for testing images of other books that don t belong to those 10 classes or images of other objects. Lately region-based convolutional neural networks has made good advances in object detection. The bottleneck, that is, the proposal step in state-of-the-art detection systems has been drastically reduced computational cost due to the novel network Faster R-CNN [4], which enables a fast and effective object detection system. For the very deep VGG-16 model, the detection system has a frame rate of 5fps on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC2007 and 2012. This paper studies the image detection of books based on deep learning and we apply Faster R-CNN to recognize 10 classes of books in images and predict bounding boxes of them. Experiments show that the detection network retrained on our database achieves good detection accuracy and this also lays the foundation for the integrated intelligent book retrieval system. II. THE DETECTION OF BOOKS BASED ON FASTER R- CNN Faster R-CNN is an region-based object detection network that proposed by Shaoqing Ren et al. Object detection networks depend on region proposal methods to hypothesize object locations, while the original methods are typically rely on inexpensive features and implemented on the CPU, which makes the running time of the proposal step intolerable. Faster R-CNN is proposed to lead to an elegant and effective solution to this problem. Faster R-CNN includes a Region Proposal Network (RPN) and an object detection network. Shaoqing Ren et al introduce RPN into convolutional neural network and train an end-to-end network to enable nearly cost-free region proposals. The architecture of Faster R-CNN is shown in Figure 1. The top branch is Fast R-CNN [5] and the bottom branch is RPN. Note that Figure 1 just shows how the two networks share convolutional s and their structural similarities which is the foundation of the network convergence in its training scheme, the ultimate model of Faster R-CNN is the upper branch. The intermediate plays the same role as ROI (Regions of Interest) pooling and pools a small window of the convolution map to a fixedsized vector. The s in the network refer to fully connected s. In the training phase, the RPN takes an image of any size as input and outputs a set of rectangular object proposals, each with 4 coordinates of predicted object bounding box and 2 scores that estimate probability of object/not-object. By minimizing the classification loss and regression loss for learning region proposals, the RPN are trained to generate high quality proposals. The object detection network is Fast R-CNN, it takes the proposals generated by RPN as input and performs elaborate classification and positioning for each proposal by calculating softmax probabilities and boundingbox regression offsets for each proposal. The training scheme of the network is ingenious. In the first two steps, the RPN and Fast R-CNN are trained independently. In the last two steps, two networks are trained alternately to share convolutional s and fine-tune their respective fully connected s for the ultimate model.

In the test phase, the re-trained model takes test images as input and outputs the predicted category label and the Convolutional bounding box for each corresponding target object. ROI pooling ROI feature vector softmax probabilities bbox regressor offsets 2k scores Imput RPN Intermediate 256-d vector 4k coordinates Figure1.The architecture of Faster R-CNN. A. Database preparation In this paper we are training a network on our database of books to identify 10 classes of books and predict object bounds of them simultaneously. We use the same image database as that in paper [3]. Figure 2 shows some examples of the database. The database contains more than 4000 images of 10 classes of books that taken in different natural backgrounds under 4 lighting conditions. The 4 lighting conditions respectively are outdoor natural light at noon, outdoor natural light at dusk, indoor lamplight and indoor natural light at dusk. Considering different users may have different shooting habits, the images are taken by different users with their own equipment to simulate actual shooting. More details about the construction of the database please refer to paper [3]. We set the labels of books respectively to 0 ~ 9 and select randomly 1500 images for training, 1500 images for validation and 1000 images to test, that s 150 training images, 150 validation images and 100 test images for each class. We re-scale the images with reference to the datasets from PASCAL VOC challenges such that their longer side is 500 pixels and remain their ratios. (a) The diagram of 4 lighting conditions (b) The diagram of 10 classes of books Figure2.The examples of the image database B. Documents preparation In addition to the image database, there are two kinds of documents we need to prepare to use the Faster R-CNN framework to train the network on our database. The localization task of the network requires a set of images that come with manual annotations indicating the ground-truth locations of books within the images. We adopt the graphical image annotation tool LabelImg created by tzutalin [6] to do the annotation work. LabelImg is written in Python and uses Qt for its graphical interface. We firstly predefine the classes of books as 0~9 in LabelImg. The means of annotation is drawing a tight rectangular region of interest also called bounding box to surround each book in an image, and then adding a label for it. The examples of annotation are shown in Figure 3. LabelImg can record mouse clicks and save the coordinates and label of each bounding box in an annotation file. Each image corresponds to an annotation file. The annotation file also includes information such as the name and the size of the image and the number of channels of the image. The annotation file will be saved as a XML file of which the format is same as the format adopted in PASCAL VOC challenges.

Figure3. The examples of annotation Apart from annotation files, the code framework of Faster R-CNN requires 4 TXT files for each class of books and for the whole database under the /datasets/ VOCdevkit2007 / VOC2007 / ImageSets / Main directory. These files are named train.txt, test.txt, val.txt and trainval.txt respectively. Just as their names imply, these.txt files indicate the use of these images by including their names and giving different numbers as marks according to their usage. For example, the images of a class of books used for training are marked 1 behind their names in the train.txt that corresponds to the class, while the other images are marked -1. In this paper, we write a simple MATLAB program to generate these files. C. Implementation details In this paper, we adopt open source code of Faster R-CNN framework created by Shaoqing Ren [7] et al to implement the detection of books. The official code is written in MATLAB, while a Python reimplementation of the MATLAB code is also available at Github. In this work we use MATLAB version of the code since we are using Windows 7 64-bit operating system. Our graphics card is NVIDIA Quadro K2200 and the GPU memory is 4GB GDDR5. The framework also needs the support of Caffe [8], into which datas about proposals and labels are actually fed through MATLAB interface to perform calculation and weight update. We download the ready-made mex file compiled by Caffe that provided and included in the code repository under the /external/caffe directory by the developers of Faster R-CNN. 1) RPN training In the first step, we download an ImageNet-pre-trained ZF [9] net to initialize RPN. The ImageNet trained ZF model is an 8 convnet model and it generalize well to other datasets. The RPN is fine-tuned end-to-end for the region proposal task. The architecture and training process of RPN is shown in Figure 4. In RPN, we set one image per mini-batch of which the data is fed into Caffe to perform forward propagation and back propagation through MATLAB interface each time. We randomly sample 256 anchor boxes in an image and the ratio of positive and negative anchors is 1:1. We set the overlap threshold for an anchor box to be considered foreground to 0.7. The anchor box that has an overlap lower than 0.3 with any ground-truth boxes is considered negative example. The ground-truth label is 1 for a positive anchor and 0 for a negative one. Those examples with labels and coordinates of ground-truth boxes are used for supervised training of RPN. Note that negative anchors do not contribute to the regression loss at this stage. After the training of RPN, we input test images into the fine-tuned RPN and output a set of predicted proposal boxes, each with 2 scores that estimate probability of object/notobject and 4 coordinates. Because we adopt non-maximum suppression (NMS) on the proposal boxes based on their scores, the number of proposals is reduced to about 2k per image. Input Convolutional Intermediate 256 -d cls reg 2k scores 4k coordinates k anchor boxes (a) Images Generate anchor boxes Assign class labels to anchor boxes (b) Figure4. The architecture and training process of RPN Training 2) Fast R-CNN training In the second step, we use the proposals generated above to train a separate detection network Fast R-CNN. The Fast R- CNN is also initialized by the ImageNet-pre-trained ZF model. In this step we set 2 images per mini-batch. For each image of mini-batch we randomly select 64 proposals that include 16

positive examples and 48 negative examples. Unlike RPN, we set the overlap threshold for a proposal to be considered positive to 0.5 and the rest are background examples. The ground-truth labels of positive examples are their class labels and those of negative examples are 0s. Likewise, we pass the data to Caffe through MATLAB interface to train Fast R- CNN by back-propagation and stochastic gradient descent (SGD). 3) Network convergence In the third step, we use Fast R-CNN to initialize RPN and fix the convolutional s while fine-tune the s unique to RPN using training samples. In the end, we use the region proposals generated in step 3 to fine-tune the fully connected s of the Fast R-CNN while keeping the shared convolutional s fixed. At this point the two networks share the same convolutional s and form a unified network. D. Experimental Results and Analysis 1) Comparison and analysis of two networks The test results using Faster R-CNN retrained on our database of 10 classes of books are shown in Table 1. The database includes 3000 images for training and validation and 1000 images for test. We evaluate the performance of the network in two ways which are the accuracy of classification and the accuracy of bounding box prediction, also called regression accuracy. TABLE1. DETECTION RESULTS ON OUR DATABASE USING FASTER R-CNN label classification accuracy regression accuracy 0 1 2 3 4 5 6 7 8 9 map (%) 0.99 1 0.99 0.91 0.99 0.99 1 0.99 1 1 98.6 0.99 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 98.9 Test time(s) 190.85 We can see the mean Average Precision (map) of the classification is up to 98.6%, which outperforms the network we used in paper [3]. And our system takes 190.85 seconds for the test of 1000 images. In paper [3] we adopt Caffe to retrain a CNN model with 3 convolutional s and 2 fully connected s. We use the same database with 4000 training images and 400 test images, and we rescale and crop the images respectively to 32*32 and 96*96 resolution. The results of book recognition is shown in Table2. TABLE2. THE ACCURACY OF BOOK RECOGNITION BASED ON CNN Number Resolution 4000 8000 32*32 97.36% 97.53% 96*96 96.60% 97.79% The network in this paper uses 5 convolutional s to calculate convolution maps of images and the resolution of input is much higher than that in paper [3]. The RPN also help increase the number of training examples. All these factors contribute to the improvement of recognition accuracy. We should also be alert that for a network of a certain depth, the increase of resolution of the input may leads to a loss in accuracy since it brings in too many parameters. 2) Tests on negative examples The recognition rate in paper [3] is quite good since our classification task is relatively simple for deep learning methods. However there is a significant problem that any test images that do not include target objects will also be classified wrongly as one of the 10 classes. Faster R-CNN solves the problem easily by introducing a class named background. The network randomly selects patches from the background of images in the training phase and use them as negative samples to train the network. Thus when we use retrained network to test images that do not include target objects, the network will classify them as background. We run a test on 300 negative examples using our retrained network. The negative examples are 150 natural images that selected from PASCAL VOC datasets randomly and 150 images of another 10 classes of books taken in similar circumstances as our training examples. The test results on images from VOC datasets are pretty good and the detection accuracy is 1, which means the network detects no book in those natural images. However the results on another 10 classes of negative examples are less accurate. The diagram of testing negative examples is shown in Figure 5. We can see from Figure 5(a) that the book cover on the right is much like our positive examples labeled 7s, so the error recognition rate is relatively high for this specific kind of negative examples. The system also shows a relatively high error recognition rate for a few examples that have simple cover design and are similar as our positive examples labeled 3s, shown in Figure 5(b). Apart from these the network performs fine and barely makes mistakes. The results suggest that we need to work on our database since the images contain a little background that do not contain books, and the background is lack of diversity. Therefore the negative examples don t provide fully effective information for the network learning, which is important for open set test. The network performs well on the datasets for now, but we need to expand the datasets on its capacity and diversity when the detection task becomes more complex. (a)

(b) Figure5. The diagram of testing negative examples III. CONCLUSION This paper studies the latest object detection network Faster R-CNN and adopts the code framework created by its authors to implement efficient and accurate detection of books. We improve the classification accuracy of books and solve the problem of testing negative samples existed in our previous study. In the further study, we may consider increasing the capacity and diversity of the database, and using deeper networks to train a more complex detection model that suits for the practical application. ACKNOWLEDGMENT This paper is under the financial aid of the National Key Technology R&D Program (2015BAK22B02) and (2014BAH10F02). REFERENCES [1] Hao Wang, Ming Yan and Xinning Su, The automatic classification of Chinese books title based on machine learning, Journal of Library Science in China, Vol.36, No.190, pp.28-39, 2010. [2] Peng Ye, The automatic classification of Chinese journal articles based on machine learning, Nanjing University, 2013. [3] Beibei Zhu, Lei Yang, Xiaoyu Wu and Tianchu Guo, Automatic Recognition of Books Based on Machine Learning, International Symposium on Computational and Bussiness Intelligence(ISCBI), pp.74-78, 2015. [4] Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun, Faster R- CNN: Towards Real-Time Object Detection with Region Proposal Networks, Neural Information Processing Systems(NIPS), 2015. [5] Ross Girshick, Fast R-CNN, International Conference on Computer Vision(ICCV), 2015. [6] Tzutalin, labelimg: A graphical image annotation tool. https://github.com/tzutalin/labelimg. [7] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster R-CNN: https://github.com/shaoqingren/faster_rcnn. [8] Yangqing Jia, Caffe: An Open Source Convolutional Architecture for Fast Feature Embedding. http://caffe.berkeleyvision.org (2013). [9] Matthew D.Zeiler, Rob Fergus, Visualizing and Understanding Convolutional Networks, European Conference on Computer Vision(ECCV), Vol.8689, pp.818-833,2013. AUTHORS BACKGROUND Your Name Title* Research Field Personal website Beibei Zhu master student Image processing zhubeibei@cuc.edu.cn Xiaoyu Wu associate professor Image processing wuxiaoyu@cuc.edu.cn Lei Yang full professor Digital media technology young-lad@263.net Yinghua Shen associate professor Digital media technology shenyinghua@cuc.edu.cn