A Novel Representation and Pipeline for Object Detection

A Novel Representation and Pipeline for Object Detection Vishakh Hegde Stanford University vishakh@stanford.edu Manik Dhar Stanford University dmanik@stanford.edu Abstract Object detection is an important problem in Computer Vision research. Neural network based models have not reached performance as high as they have reached in object classification which is an intimately related task. These methods usually consider the background as another class for an object classifier which doesn t exploit the different nature of background as compared to objects. We propose a novel training criterion which tackles background separately. At the same time, we examine how Learning without Forgetting and finetuning perform in transferring from the classification to the detection task. We train on the canonical PASCAL VOC dataset. We provide results for a small network trained from scratch and results for a larger network pre-trained on ImageNet followed by finetuning with and without Learning without forgetting for object detection. 1. Introduction Image perception, or the ability to understand the contents of an image is the holy grail in computer vision and artificial intelligence. The task of image classification was very hard for computers, until very recently. With the advent of deep convolutional neural networks, computers are now able to beat humans on image classification (at least on large scale datasets like ImageNet). A related task, object detection aims to localize and classify an object in an image. Having good models for object detection is very important in a variety of tasks including medical imaging, surveillance and object tracking, among others. Current state-of-the-art object detection models faster-rcnn [7] and SPP-Net [5] re-factor existing classification models like AlexNet and ZF-Net to suit the requirements of object detection. Parts of the image which does not contain the object is generally called background. Making a distinction between background and object is crucial, since most images contain background and a bulk of the image is usually background in most natural images. Most state-of-the-art object recognition algorithms treat background as just another category, along with object categories. Classification is usually performed on a region in the image large enough to hold objects completely, but small enough to exclude a lot of background. Treating background as another category does not make intuitive sense since it is present universally in every natural image, while a given object is usually not present in all images. Another distinction is that Background images can potentially have large intra-class variation, while most objects have lesser intra-class variation. Therefore, we should not be using the same representation that is used to perform classification to also perform object detection. In this work, we place special emphasis on background and design good loss functions that can force a neural network to activate only for objects and not activate at all for non-objects (or background). This will translate to learning a better representation specific to object detection. 2. Previous work Object detection is a much harder problem compared to object classification. The object needs to be localized within a region along with identifying it. R-CNN was one of the earlier attempts which involved finding region proposals using a method like selective search and then later using a convolutional neural network to extract features from it for Object detection and classification [4]. At the same time, another approach involving framing the localization as a regression problem was proposed [9] and it did not perform as well as R-CNN. R-CNN is slow to train and test and consumes a lot of disk space. For that reason, variants were proposed to increase the training and testing time. SPP- Net [5] and Fast-RCNN [3] are the two most notable variants. Faster-RCNN [8] uses a region proposal network to produce object proposal leading to an end-to-end trainable system for object detection. The methods mentioned train a new layer on top of fc7 layer of AlexNet. This new layer accounts for new classes and an extra background class which ideally captures everything apart from the objects of interest. Our approach is independent to the previous work and can be easily adapted 1

to state of the art object detection architectures like Faster- RCNN. Our approach is vastly different, in that we force our representation to output a non-zero vector only if the input image is an object. This way, we force our neural network to encode information in all the neurons of the feature layer. This is not necessarily the case in networks like RCNN (a subset of neurons might actually be sufficient). We propose this with the intuition that part of the power of a deep neural network comes from its ability to learn a distributed representation that it can combine in multiple ways. Learning without forgetting (LwF) was introduced in [10] as a replacement transfer learning strategy for fine tuning; in that, the model also performs well on the original task. Apart from the obvious advantage of performing well on both old and new task, LwF also acts as a regularizer while training for the new task and therefore prevents over fitting on the new task. However, in all their experiments, the authors train their models on a large-scale dataset like ImageNet [1] or Places2 dataset for image classification and transfer the knowledge to smaller datasets like PASCAL VOC [2] for classification. Their old and new skills are the same (namely classification). They show some evidence that performing LwF on a very different task (like classifying different kind of images) within the same skill domain (classification) will significantly degrade its performance on the old task [10]. In particular, they show that training a model for classification on Places2 and performing LwF on CUB dataset resulted in a significant degradation of performance on Places2, since these tasks are very dissimilar. An interesting related question is to see if LwF works well when applied on dissimilar skills (like classification to localization/bounding box regression). To this end we use AlexNet pre-trained on ImageNet and train it via Learning without Forgetting. LSDA: Large Scale Detection through Adaptation [6] is a method to train an object detection network where the training set contains images for classes but bounding box labels only for a subset of these classes. The changes we discuss for R-CNN can also be adapted for LSDA. We discuss this further in a later section. 3. Main Contributions 3.1. A New Representation for Object Detection This is obtained using a novel loss function that forces the neural network to activate only to objects and not activate at all to non-objects. This translates to pushing feature vectors corresponding to non-objects to the origin of the feature space and to the surface of a unit hyper-sphere for feature vectors corresponding to objects. Figure 1: Example from PASCAL VOC detection dataset 3.2. Compare Transfer Learning Strategies We compare Learning without Forgetting [10] and finetuning transfer learning strategies on learning a new skill like object detection, using weights learned for image classification. The idea is that learning without forgetting strategy acts as a regularizer and therefore is a better transfer learning strategy on small datasets. 4. Dataset Used While there are multiple datasets out there to train learning algorithms for object detection, we use PASCAL VOC 2012 (for detection) for training, validation and testing. PASCAL VOC (for detection) consists of images containing objects belonging to 20 different categories. It consists of a bunch of transportation vehicles, animals (including people) and everyday objects. The metadata, for each image, consists of a list of all objects in the image and their corresponding ground truth bounding boxes. An example from PASCAL VOC can be seen in figure 1. 4.1. Region Proposals Ground truth regions themselves are not sufficient to train a neural network since it does not contain background regions explicitly. [4] provide bounding boxes for the train and test sets they use. They obtain these by running the images through a selective search algorithm. However, these region do not come pre-assigned with a label. It is upon us to use the ground truth bounding box information to infer what the bounding boxes from selective search consists of. We write our program to assign labels to these region proposals.

loss on top of the feature layer to classify the object. For m classes and n image crops, the cross-entropy loss is: n m 1 XX (i) 1{yj = 1} log(softmaxθ (φ(x(i) ), j)) n i=1 j=1 T Figure 2: Crops from selective search produced by [4]. The top row consists of objects while the bottom row has background crops 4.1.1 Label Assignment For Proposed Regions We use the Intersection over Union (IoU) metric to assign labels. For each proposal, we find the IoU over all ground truth bounding boxes with a threshold of 0.7. i.e. we are only interested in ground truth bounding boxes that have an IoU of > 0.7. If we manage to find multiple such ground truth boxes, we assign the label corresponding to the maximum IoU ground truth box to the proposal. If we fail to find IoU values crossing a threshold of 0.7, we treat the proposal to be background. For each image, we have about 2500 bounding boxes as obtained from selective search. With the threshold we use, we find that roughly 10% corresponds to some object, while the remaining 90% are background. We provide some examples of crops thus generated in figure 2. 4.1.2 eθj φ(x) softmaxθ (φ(x), j) = P m T eθk φ(x) k=1 where θ is the classifier weight vector, x is an input image crop, y is the one-hot vector for the class labels and φ is a function representing the neural network architecture. The RCNN model [4] also has a similar loss function for object classification. The main distinction is that our classifier will not have a background class, whereas the RCNN classifier treats the background as another class. 5.2. Loss Function to Control L2 norm The loss function should penalize high L2 norm values for non-objects, and low L2 norm values for objects. For this we design the following two loss functions: Spherical Hinge Loss Spherical Softmax Loss 5.2.1 We define the L2 norm hinge loss as follows: n (i) 1X {(kφ(x(i) )k22 1)( 1)1{ky k1 =1} }+ n i=1 Engineering Limitations Due to hardware limitations (disk space) we were forced to use a subset of the full dataset. We obtain about 38000 crops corresponding to objects and more than 1M background training examples. However, we randomly discard most of them and keep only 100000 randomly chosen background crops for training. Spherical Hinge Loss where {x}+ = x if x > 0, else 0 Here, 1{kyk1 = 1} indicates if an image crop contains an object or not. For example, if there is no object in the image crop, the class vector will be zero. 5. Technical Details Our goal is to get a good representation for object detection. As mentioned before, we want to force the neural network to produce non-zero activation only when it is fed an object. For background crops, it should ideally not produce activation. Concretely, this means that the L2 norm of the final features layer should be zero for non-objects and close to 1 for objects. We achieve this by designing loss functions that force the norm of the final features to be zero for non-objects. 5.1. Loss Function for Object Classification Given that the proposed region has an object in it, we train a softmax classifier with the standard cross-entropy 5.2.2 Spherical Softmax Loss In this approach we train a 2-class softmax classifier on the square of the norm of the last feature layer to find background images. We provide the equation below (which is simplified because there are only 2 classes and the feature is just a scalar): n 2 1X (1{kyk1 = 0} log(1 + ekkφ(x)k2 +b ) n i=1 2 +1{kyk1 = 1} log(1 + e kkφ(x)k2 b )) k and b are two scalar parameters which we need to train over.

where, z (i) j = (z(i) j ) 1 T m img k=1 (z (k) j ) 1 T, ẑ (i) j = (ẑ(i) j ) 1 T m img k=1 z (i) j = softmax θ img(φ(x (i) )) (ẑ (k) j ) 1 T This loss function ensures that information about the previous task is maintained and acts as a regularization term for our object detection task. Figure 3: Schematic of the three layer convolutional neural network 5.3. Neural Network Architectures Used 5.3.1 Three Layer CNN In order to quickly validate our hypothesis on using the loss functions on L 2 norm, we use a three layer neural network since it is fast and easy to train. In a bid to reduce the number of parameters, we resize all crops to have a size of 80 80 3. We provide a schematic of our neural network in figure 3 5.3.2 AlexNet Once we validated our hypothesis of using L 2 norm for classification, we started using AlexNet pretrained on ImageNet. We use AlexNet to compare finetuning and LwF transfer learning strategies. The reason for this choice is that pretrained weights TensorFlow is available online and is one of the simplest deep networks for analysis. 5.4. Learning without Forgetting (LwF) The network is initially trained on classifying the ImageNet dataset. To ensure that previously learned capabilities are not forgotten, we use Learning without Forgetting (LwF) transfer learning strategy. LwF also provides good regularization while training the weights of the network. Let φ represent the original network, θ img be the original weights for the ImageNet classifier and m img be the number of classes in ImageNet. We compute, ẑ (i) = softmax θimg( φ(x (i) )) where softmax θimg is the output of the softmax layer for the ImageNet classifier which will be an m img size vector. We use knowledge distillation loss to minimize the change in the output of the old task. The loss function for learning without forgetting is, 1 n m img j=1 n i=1 ẑ (i) j log(z (i) j ) 6. Experiments 6.1. Experiment 1: Comparing Detection Pipelines There are three different to take in an image crop and perform classification during inference: RCNN like classification: Here the object is treated as just another class and the network is trained to classify crops into one of 21 categories. The schematic is shown in figure 4 Network trained using Spherical Hinge Loss: The norm of the final features is first computed. If the norm is less than 1, it is declared background. Otherwise, it is passed through a softmax classifier which classifies it into one of 20 object categories. A linear combination of the two loss values is taken with the weights being hyper-parameters. For our experiments, we use a weight of 1 for each of the loss values. Network trained using Spherical Softmax: In this pipeline, the norm is used to directly infer (due to binary softmax loss) whether or not it is an object. If it is an object, it is passed through a softmax classifier which classifies it into one of 20 object categories. A linear combination of the two loss values is taken with the weights being hyper-parameters. For our experiments, we use a weight of 1 for each of the loss values. The schematic for the latter two networks is depicted in figure 5. In order to compare these three approaches, we use a three layer convolutional neural network as the base network, as mentioned mentioned previously. In the first experiment, we compare the classification accuracy on a fixed validation set for all the three pipelines mentioned above. 6.2. Experiment 2: Comparing Transfer Learning Strategies RCNN uses fine-tuning transfer learning strategy on AlexNet weights learned on ImageNet. We want to see if the newly introduced Learning without Forgetting (LwF) fine-tuning strategy works better than fine-tuning. We found from the first experiment that network trained with Spherical Hinge Loss (we refer to it as SHL) performs better

Figure 4: Schematic of the network used for RCNN like classification. Base network is a three layer CNN Figure 6: Schematic of SPH with LwF loss function added Figure 5: Schematic of the network used for classification using Spherical Hinge Loss and Spherical Softmax Loss on the L 2 norm. Base network is a three layer CNN than both RCNN like object classification (we refer to it as RCNN(ours)) and network trained with Spherical Softmax Loss (we refer to it as SSL). Therefore, we narrow down this experiment to comparing finetuning strategies on RCNN(ours) and SHL. The base network used is AlexNet pre-trained on ImageNet. Figure 7: Schematic of RCNN(ours) with LwF loss function added 6.2.1 Finetuning We finetune RCNN(ours) and SHL from conv4 layer onward. The reason for this choice is that initial layers of the neural network are found to be simple edge detectors and Gabor like filters, which generalize well across multiple datasets. However, we expect it not to work very well on SPH since it tries to drastically alter the distribution of data in the space of representation. 6.2.2 Learning without Forgetting (LwF) We use AlexNet trained on ImageNet as an anchor network whose weights never get updated. We use a second copy of AlexNet pre-trained on ImageNet but update the weights according to a loss function which takes into account the LwF loss. Similar to finetuning, we only update weights of the base network from conv4 layer onward. The schematic for such a network for SHL is given in figure 6 and the same for RCNN(ours) is given in figure 7. The final loss function is the linear combination of each of the loss functions given in Figure 8: Histogram (log-log scale) of the squared-norm values of the prefinal layer for objects (red) and non object (green) with spherical hinge loss the respective figures, where the weights in the linear combination are hyper-parameters. 7. Results 7.1. Experiment 1 From 6.1, we find that SHL performs better than RCNN(ours) and the network trained on SSL. We train the

Figure 9: Histogram (log-log scale) of the squared-norm values of the prefinal layer for objects (red) and non object (green) with spherical softmax loss Figure 10: t-sne diagram for objects (red) and background (blue) for RCNN(ours) model over 100 epochs and obtain validation accuracies in steps of 10 epochs. We report the best accuracies among them in table 7.1. We use the following abbreviations, OCA = Object classification accuracy. This is the classification accuracy among the 20 object categories. CA = Over all classification accuracy of all objects, including background BC = Background classification. This essentially measures the accuracy of classifying between objects and non-objects. Model OCA CA BC RCNN like classifier 0.133 0.764 0.793 Spherical Hinge Loss Net 0.348 0.755 0.801 Spherical Softmax Loss Net 0.241 0.653 0.716 7.1.1 Discussion We also plot histograms of the norm-squared value of the pre-final layer of SHL 8 and SSL 9. We observe that SHL leads to nice and clear separation between the two classes, while there is a lot more overlap between the object and non-object classes when we use the Spherical Softmax Loss. This is again validated when we compare the classification accuracies. We use t-sne to visualize how object and background images are distributed in the embedding space in figures 10, 11 and 10. We see that for the vanilla network, the background images are distributed haphazardly whereas for SHL and SSL, they are more concentrated. They are more concentrated for SHL, than SFL. Figure 11: t-sne diagram for objects (red) and background (blue) for SPH 7.2. Experiment 2 7.2.1 Comparison between SHL and RCNN (ours) for LwF strategy From figures 13, 14 and 15, we find that while SHL performs better than RCNN(ours) on OCA, it performs very badly on CA and BC metrics. 7.2.2 Comparison between SHL and RCNN (ours) for finetuning strategy From figures 16, 17 and 18, we find that while RCNN(ours) performs better than SHL on all the accuracy metrics.

Figure 12: t-sne diagram for objects (red) and background (blue) for the spherical softmax network Figure 14: Comparison of CA for SHL and RCNN(ours) with LwF strategy Figure 13: Comparison of OCA for SHL and RCNN(ours) with LwF strategy Figure 15: Comparison of BC for SHL and RCNN(ours) with LwF strategy 7.2.3 Comparison of SHL for finetuning and LwF transfer learning strategies From figures 19, 20 and 21, we find that SHL with LwF performs better than SHL with FT on OCA and worse worse on CA and BC metrics. 7.2.4 Discussion Observations 7.2.1 and 7.2.2 is not at all surprising. The reason is that we use pre-trained AlexNet which is trained to perform classification. SHL imposes drastic constraints on the embeddings of the neural network, whereas RCNN(ours) can simply build off of the weight structures produced by pre-training on ImageNet. Also, since we do not train all the network weights, this effect is even more pronounced. Figure 16: Comparison of OCA for SHL and RCNN(ours) with Finetuning strategy

Figure 17: Comparison of CA for SHL and RCNN(ours) with Finetuning strategy Figure 20: Comparison of CA for SHL on Finetuning strategy against SHL on LwF strategy Figure 18: Comparison of BC for SHL and RCNN(ours) with Finetuning strategy Figure 21: Comparison of BC for SHL on Finetuning strategy against SHL on LwF strategy using special loss functions like the Spherical Hinge Loss and the Spherical Softmax Loss on the L 2 norm of the embeddings of the base network. We perform two experiments: 6.1 and 6.2 and find that when we train networks with random initialization, using Spherical Hinge Loss on the L 2 norm is more effective than RCNN(ours), where background is treated as another class. Figure 19: Comparison of OCA for SHL on Finetuning strategy against SHL on LwF strategy 8. Conclusions We base our exploration on the intuition that background should be treated differently. We try to incorporate this by However, when warm-starting the learning process (both finetuning and LwF) on ImageNet pre-trained networks, RCNN(ours) performs better than SPH. The reason for this has been discussed in 7.2.4. We believe that training SPH on large scale object detection datasets like ImageNet for object detection and then using transfer learning techniques for smaller datasets like PASCAL VOC might actually perform better than the original RCNN [4].

Figure 22: Detection with the LSDA network. Given an image, extract region proposals, reshape the regions to fit into the network size and finally produce detection scores per category for the region. Layers with red dots/fill indicate they have been modified/learned during fine-tuning with available bounding box annotated data. Learning without forgetting can allow us to protect these layers from loosing information about classes in set A. Background detection can be done using the spherical hinge loss. 9. Future Direction 9.1. Fast and Faster-RCNN The methods we describe can also be run on the Fast and Faster-RCNN network. Their current implementation is in Caffe. We had started out with working on TensorFlow and implementing Fast and Faster-RCNN requires Region of Interest Pooling layers (introduced in Fast-RCNN) which is not currently implemented in tensorflow. Open-source implementations we found didn t work well. Therefore, we decided to run experiments on the R-CNN network instead. It is important to note that the improvements made by Fast and Faster-RCNN networks are orthogonal to our modifications in purpose and therefore can be used together to create a object detection system. 9.2. LSDA LSDA: Large Scale Detection through Adaptation network [6] solves a more general problem. They consider a scenario where you have a dataset with training data for classification, but only a subset of that dataset has bounding box training data for object detection. Their method allows them to train a network which can solve the object detection problem for the whole dataset. We describe their approach here. The set of classes is split in two depending on whether they have bounding box label images. Say set A doesn t and set B does. They start out with a network which is trained for classification on the whole dataset (A B). Unlike a usual classification network they don t use normalized softmax values and instead use linear scores which can lie anywhere over the reals. After the training we have final layer which provides object detection scores for the whole dataset. f A correspond to cells which provide scores for classes in set A. Similarly we have f B. Next, they initialize new empty cells in the last layer to encode object detection information for the background and the classes with bounding box data available δb. The object detection score for classes in B is computed by adding the classification scores and the scores from the new cells f B + δb. For classes in set A they find the nearest neighbors (according to the weights in the last layer) in set B and average their scores to get approximate scores for a δa layer if the data was available. The additions we consider for RCNN: Spherical Hinge Loss and LwF can be used on the LSDA network as well. While training, all the previous layers are finetuned over data corresponding to set B. The LwF loss would act as a regularizer and prevent knowledge about set A from being lost. Similarly for background classification, Hinge Loss can be used instead. References [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. 2 [2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/voc/voc2012/workshop/index.html. 2 [3] R. Girshick. Fast r-cnn. In International Conference on Computer Vision (ICCV), 2015. 1

[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014. 1, 2, 3, 8 [5] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR, abs/1406.4729, 2014. 1 [6] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. Lsda: Large scale detection through adaptation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3536 3544. Curran Associates, Inc., 2014. 2, 9 [7] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015. 1 [8] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91 99. Curran Associates, Inc., 2015. 1 [9] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2553 2561. Curran Associates, Inc., 2013. 1 [10] D. H. Zhizhong Li. Learning without forgetting. arxiv preprint arxiv:1606.09282v2, 2016. 2