REAL-TIME OBJECT DETECTION WITH CONVOLUTION NEURAL NETWORK USING KERAS Asmita Goswami [1], Lokesh Soni [2 ] Department of Information Technology [1] Jaipur Engineering College and Research Center Jaipur[2] {asmita.jaipur@gmail.com, sonilokesh24@gmail.com } ABSTRACT This paper presents how we can achieve the accuracy of classification, localization and detection of an object by using Convolutional Networks. We applied increasingly complex neural networks to simple images after which detection of multiple objects of varying shape and colour was enabled. We also predict object expected boundaries via mean squared error(mse) using a deep learning approach to localization by learning. In order to increase detection confidence bounding boxes are then accumulated rather than suppressed. We used adadelta as an optimizer which is basically standard stochastic gradient descent, but with an adaptive learning rate. We show that single shared network can learn different tasks simultaneously. 1) Introduction In the past years, there has been tremendous progress in the field of machine learning for addressing these difficult problems. A model which can achieve reasonable performance on the task of hard visual recognition of matching or exceeding human performance in certain domains is called a deep convolutional neural network[1, 2]. Images are easy to generate and handle, and they are easy to understand for human beings, but difficult for computers. Image analysis has always played a key role in the history of deep neural networks. Asmita Goswami and Lokesh Soni 1
REAL-TIME OBJECT DETECTION WITH CONVOLUTION NEURAL NETWORK USING KERAS Convolutional neural networks[12] are the state of the art technique for identifying objects that is image recognition. Until the emergence of convolutional neural networks, it has been difficult to implement object recognition using machine algorithms which is natural in humans. With boost and improved designs in CNN, annotated data set, by the availability of cheap computing power and enhanced techniques such as inception modules and skip connections, notably deeper models with more layers is enabled and have created models that challenge the accuracy of human in object identification. However, in terms of detection speed, however, even the best algorithms are still suffering from heavy computational cost. 2) Detection Approach a) Single Object Detection The neural network is a very simple feedforward network and a key element of this is the novel structure of the information processing system with one hidden layer (no convolutions). It predicts the parameters of the bounding box (i.e. the coordinates x and y of the lower left corner, the width w and the height h) with the input of the flattened image (i.e. 8 x 8 = 64 values). During training, simply regression of the predicted Dropout(0.2), Dense(4) ]) model.compile('adadelta', 'mse') The network is trained with 40k random images for 50 epochs (~1 minute on my laptop s CPU) and got almost perfect results. The predicted bounding boxes on the images above are as follows (they were held out during training): Figure 2: Predicted Bounding Boxes [2] The index plotted above each bounding box is called Intersection Over Union(IOU) and measures the overlap between the predicted and the real bounding box. It is calculated by dividing the area of intersection (pink in the image below) by the area of the union (blue in the image below). The IOU is between 0 (no overlap) and 1 (perfect overlap). (1) to the expected bounding boxes via mean squared error (MSE) is done. Adadelta is used as an optimizer which is basically standard stochastic gradient descent, but with an adaptive learning rate. It reduces a lot of time spent on hyperparameter optimization. Here s how the network is implemented in keras: model = Sequential([ Dense(200, input_dim=64), Activation('relu'), Figure 3: Intersection over union [3] b) Multiple Objects Detection Due to the formation of duplicate images in the centre from the usage of the single object detection method, each predicted bounding box is assigned to a rectangle during training. The predictors, then learn to narrow on certain locations or shapes of rectangles. Process the target vectors after every Asmita Goswami and Lokesh Soni 2
epoch in order to do this: For each training image, calculate the mean squared error(mse) between the prediction and the target A of bounding boxes in the target vector (i.e. x1, y1, w1, h1, x2, y2, w2, h2) and B) for the current order if the bounding boxes in the target vector are flipped (i.e. x2, y2, w2, h2, x1, y1, w1, h1). If the Mean squared error of A is greater than B, leave the target vector as is; if the MSE of B is greater than A, flip the vector. The algorithm for this is given by shape-detection[3]. The visualization of the flipping process is shown below: The network achieves a mean IOU of 0.5 on the training data. c) Classifying Objects To proceed further add triangles and classify whether an object is a rectangle or a triangle. The same network is used as above and just one value per bounding box is added to the target vector: 0 if the object is a rectangle, and 1 if it s a triangle (i.e. binary classification). Here are the results: Figure 6: Classification of objects [6] Red box means predicted is rectangle and yellow box means predicted is triangle. Figure 4: Flipping Process [4] Each row is a sample from the training set in the plot above. The epochs of the training process are from left to right. Black indicates that the target vector was flipped after this epoch, white corresponds to no flip. Most flips occur at the inception of the training when the predictors haven t specialized yet. 3) Experiments In this section, we benchmark our method of putting up Shapes, Colors, and Convolutional Neural Networks together. To bootstrap the images, we used the pycairo library[3 ], which can write RGB images and simple shapes to numpy arrays. We also made some modifications to the network itself, but let s first have a look at the results: If network is trained with flipping enabled, the following results are displayed (again on held-out test images): Figure 5: Predicted bounding boxes after flipping [5] Figure 7: Experimental Results [7] Asmita Goswami and Lokesh Soni 3
REAL-TIME OBJECT DETECTION WITH CONVOLUTION NEURAL NETWORK USING KERAS The mean IOU on the test dataset is around 0.4, which is not bad for recognizing three objects at once. The predicted shapes and colours (written above the bounding boxes) are almost perfect (test accuracy of 95 %). To assign the predictors to different objects (as we aimed for with the flipping trick), the network has really erudite. In comparison to the simple experiments above, we made three modifications 1) We used a convolutional neural network (CNN) [4] instead of a feedforward network. CNN's scan the image with learnable filters and extract more and more intellectual features at each layer. Filters in early layers may, for example, detect edges or colour gradients, while later layers may register complex shapes[5]. For the results achieved above, a network is trained with four convolutional and two pooling layers for a time period of about 30 40 minutes. Probably better results could be achieved by deeper/more optimized/longer trained network. 2) We didn t use a single (binary) value for classification, but one-hot vectors (0 everywhere, 1 at the index of the class). In particular, we used one vector per object to classify shape (rectangle, triangle or circle) and another one vector to classify colour (red, green or blue). Note that we added some random variation to the colours in the input images to see if the network can handle this. All in all, the target vector for an image consists of 10 values for each object (4 for the bounding box, 3 for the shape classification, and 3 for the colour classification). 3) We adapted the flipping algorithm to work with multiple bounding boxes. The algorithm calculates the mean squared error for all combinations of one predicted and one expected bounding box after each epoch. After which it takes the minimum among those values, assigns the predicted and the corresponding expected bounding boxes to each other, then the next smallest value is taken out of the boxes which are not assigned yet, and so on. 4) Conclusions Real-time object detection with convolutional Neural Network is an important basic research problem in computer vision and natural language processing that requires a system to do much more than task-specific algorithms, such as object recognition algorithm and object detection algorithm. Combination of recent technical innovations on deep learning makes us possible to re-design the feature extraction part of the Faster R-CNN framework to maximize the computational efficiency. As an example, we have presented an algorithm that can only predict a fixed number of bounding boxes per image. 5) References [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [2] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: the Convolutional architecture for fast feature embedding. arxiv:1408.5093, 2014. [3]https://github.com/jrieke/shape-detection/blob/master/t wo-rectangles-or-triangles.ipynb [4]https://cairographics.org/pycairo/ [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: the Convolutional architecture for fast feature embedding. arxiv:1408.5093, 2014. [6]Stanford s CS231n, Michael Nielsen s book. Deep Learning, book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville [7]Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten. Learning by Asking Questions. arxiv preprint 2017 Asmita Goswami and Lokesh Soni 4
[8]Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. Computer Vision and Pattern Recognition (cs.cv). arxiv:1506.02640 [9] Kye-Hyeon Kim, Sanghoon Hong, Byungseok Roh, Yeongjae Cheon, and Minje Park. Deep but Lightweight Neural Networks for Real-time Object Detection. arxiv:1608.08021v3, 2016. [10] Mateusz Malinowski, Marcus Rohrbach, Mario Fritz. A Neural-based Approach to Answering Questions about Images. [11] William Koehrsen. Object Recognition with Google s Convolutional Neural Networks [12]Bowen Baker, Otkrist Gupta, Nikhil Naik & Ramesh Raskar. Designing Neural Network Architectures Using Reinforcement Learning. arxiv:1611.02167v2. 30 Nov 2016. Asmita Goswami and Lokesh Soni 5