Vision based indoor object detection for a drone

Size: px

Start display at page:

Download "Vision based indoor object detection for a drone"

Franklin Walker
5 years ago
Views:

1 EXAMENSARBETE INOM TEKNIKOMRÅDET TEKNISK FYSIK OCH HUVUDOMRÅDET DATALOGI OCH DATATEKNIK, AVANCERAD NIVÅ, 30 HP STOCKHOLM, SVERIGE 2017 Vision based indoor object detection for a drone LINNEA GRIP KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

2 Vision based indoor object detection for a drone LINNEA GRIP Master in Computer Science Date: May 31, 2017 Supervisor: Patric Jensfelt Examiner: Hedvig Kjellström Swedish title: Bildbaserad detektion av inomhusobjekt för drönare School of Computer Science and Communication

3 i Abstract Drones are a very active area of research and object detection is a crucial part in achieving full autonomy of any robot. We investigated how state-of-the-art object detection algorithms perform on image data from a drone. For the evaluation we collected a number of datasets in an indoor office environment with different cameras and camera placements. We surveyed the literature of object detection and selected to research the algorithm R-FCN (Region based Fully Convolutional Network) for the evaluation. The performances on the different datasets were then compared, showing that using footage from a drone may be advantageous in scenarios where the goal is to detect as many objects as possible. Further, it was shown that the network, even if trained on normal angled images, can be used for detecting objects in fish eye images and that usage of a fish eye camera can increase the total number of detected objects in a scene.

4 ii Sammanfattning Drönare är ett mycket aktivt forskningsområde och objektigenkänning är en viktig del för att uppnå full självstyrning för robotar. Vi undersökte hur dagens bästa objektigenkänningsalgoritmer presterar på bilddata från en drönare. Vi gjorde en literatturstudie och valde att undersöka algoritmen R-FCN (Region based Fully Convolutional Network). För att evaluera algoritmen spelades flera dataset in i en kontorsmiljö med olika kameror och kameraplaceringar. Prestandan på de olika dataseten jämfördes sedan och det visades att användningen av bilder från en drönare kan vara fördelaktig då målet är att hitta så många objekt som möjligt. Vidare visades att nätverket, även om det är tränat på bilder från en vanlig kamera, kan användas för att hitta objekt i vidvinklade bilder och att användningen av en vidvinkelkamera kan öka det totala antalet detekterade objekt i en scen.

5 Contents Contents iii 1 Introduction Research Question and Hypotheses Limitations Report Outline Background Convolutional Neural Networks Other object detection methods Common Datasets Metrics Related work Drones Object detection The Object Detection Algorithm 11 5 Method Experiment Design Evaluation Experiments Fish Eye Camera Experimental Setup Results Analysis Distance to Objects Experimental Setup Results Analysis Camera Angle Experimental Setup Results Analysis iii

6 iv CONTENTS 7 Real Drone 29 8 Summary and Discussion Conclusions Error Sources Connection to Other Research Future Work Summary Bibliography 34 A Social Aspects 37 A.1 Sustainability A.2 Ethics A.3 Society

7 Chapter 1 Introduction Object detection is important for reaching higher level autonomy for robots. It is a very active area of research in robotics, applied computer vision and machine learning. Unmanned Aerial Vehicles (UAVs), or drones are being used more and more as robotic platforms. It is of interest to see how to make use of methods that have been developed in computer vision and machine learning and used for other robot embodiments on drones. The objective of this degree project is to determine how an existing object detection method can be used on image data from a drone. It will examine whether the flexibility of a drone compared to that of a traditional ground based robot can be used to improve the performance of object detection, assuming that a drone is indeed more flexible. For example, as a drone can move closer to objects than a wheeled robot can it may be possible to detect more small objects in data from a drone. Here, objects that can easily be held in one hand such as cups, cell phones and bottles are counted as small. When a drone navigates a building in search for objects, it is of interest for the drone to be able to view as much of its surroundings as possible. To achieve a large field of view the camera could be mounted on a tilting mechanism on the drone. This requires to put on more weight on the drone and to avoid this a wide angle (fish eye) camera is used instead. However, images taken by a fish eye camera are distorted and quite different from images taken by a normal camera. Therefor, it cannot be assumed that object detection algorithms normally used on "normal" images perform well on fish eye images. Part of the study is to investigate the performance of algorithms widely used on normal images on fish eye images. Previous works ([1],[2]) stress that the images captured by a drone often is different from those available for training, which are often taken by a hand held camera. Difficulties in detecting objects in data from a drone may arise due to the positioning of the camera compared to in images taken by a human, depending on what type of images the network is trained on. Therefor, different ways of positioning the drone and the camera with respect to objects will be evaluated. 1

8 2 CHAPTER 1. INTRODUCTION 1.1 Research Question and Hypotheses Can the performance of an object detection algorithm in an indoor scene be maintained and/or improved using the flexibility of a drone as compared to a ground based robot and if so, how? After a literature study the algorithm that is currently best suited for indoor detection of objects is chosen. The chosen algorithm is then evaluated on different data sets in order to determine whether there are any benefits and/or drawbacks in using data acquired by a drone instead of by a wheeled robot when trying to detect objects in an indoor scene, what type of camera to use and how to take advantage of the flexibility of the drone. Several hypotheses will be addressed, including the following. 1. The chosen algorithm can be used on image data acquired by a drone. 2. The chosen algorithm, trained on images from a normal camera, can be used to some extent on images from a fish eye camera. 3. More objects can be detected in data from a fish eye camera than from a normal camera, because of the larger field of view. 4. More objects can be detected from a closer viewpoint. 1.2 Limitations It will be assumed that a drone equipped with a RGB camera sends a continuous stream of images to a computer which then performs computations off board the drone. It is not part of the project to perform "light weight" object detection on board the drone. The drone will navigate (not part of the project) an indoor, office-like, environment and encounter and try to detect objects. It is expected to be able to detect objects such as chairs, screens and people. However, also detection of smaller object such as mugs and cell phones will be attempted. 1.3 Report Outline In Chapter 2 relevant theory of object detection is outlined. Chapter 3 touches on important works that have been made in the areas of object detection as well as drones. Chapter 4 describes briefly the algorithm used for detecting objects throughout the project. The general method used for performing experiments and evaluating performance of the object detection algorithm on different datasets is described in Chapter 5. The three sections of Chapter 6 each present an experiment. They contain first a brief motivation of why the experiment was important, then a description of the experimental setup, the results obtained and lastly a short analysis of the results. These three experiments were carried out without using a real drone. Chapter 7 then show data acquired by a camera mounted on a real, flying drone, and the detections as predicted by the algorithm.

9 CHAPTER 1. INTRODUCTION 3 The results are discussed in Chapter 8, which also proposes future work. In particular, Section 8.5 contains a summary of the report and appendix A presents a brief discussion about social aspects of using drones and object detection.

10 Chapter 2 Background In this chapter important theory and concepts that may not be common knowledge is explained. Object detection entails detecting instances of predefined object classes in images. The object instances should also be localized using a so called bounding box, a box containing the object in the image. There are many ways of performing object detection, each method with different strengths and weaknesses. 2.1 Convolutional Neural Networks Convolutional Neural Networks (CNNs) are special types of Neural Networks that are especially well designed for usage on images. This allows for optimizing the architecture so that the amount of parameters of the network can be reduced, compared to a regular Neural Network, and the method made more efficient [3]. A CNN generally consists of two main parts; convolutional layers followed by fully connected layers. CNNs are trained end-to-end, that is, from pixels to final classification without needing to introduce any particular feature extractor which make CNNs a good choice for various general object detection tasks. However, training a CNN requires very large sets of images compared to other object detection methods [4]. There are several CNN based methods available and state-of-the-art object detection of today builds on CNNs, as will be described in Chapter 3. Convolutional layers The convolutional layers of a CNN performs a sliding window operation and outputs feature maps. Each convolutional layer of a CNN represents a certain type of feature and each corresponding output feature map is a spatial activation image where the strongest responses to the feature of interest are indicated. For example, a convolutional layer applied on an image of a box could output a feature map showing strong activations on the positions of the corners of the box. In Figure 2.1, the depth of the dotted box represents the number of these feature layers. In the learning process the weights of the convolu- 4

CHAPTER 2. BACKGROUND 5 tional layers are tuned so that features that minimize prediction errors more are taken into larger account than less helpful features.

11 CHAPTER 2. BACKGROUND 5 tional layers are tuned so that features that minimize prediction errors more are taken into larger account than less helpful features. The convolutional layers do not require any specific image size. Figure 2.1: Schematic figure of a CNN. Figure copied from [3]. Fully-connected layers Fully-connected layers are often added on top of the convolutional layers to perform the actual classification, since the outputted feature maps of the convolutional layers are still low level. The fully-connected layers are built in the same way as regular Neural Networks and have full connectivity since all neurons of each layer are connected to all outputs from the previous layer. Often, fully-connected layers consist of regular Neural Networks or other classifiers, such as support vector machines (e.g. [5],[6]). These layers take the feature maps as input and classifies objects in the image depending on the activated features in the feature maps. The fully-connected layers require fixed size input vectors - a property that used to cause problems when images were of differing sizes (e.g. [5],[7]) but have now been addressed (e.g. [6],[8]), as will be mentioned in Chapter 3. Region Proposals Many object detection methods of today rely on some type of region proposal algorithm, which can be integrated with (e.g. [8],[9]) or separated from (e.g. [5],[7],[6]) the CNN itself. The region proposal algorithms suggest regions, or bounding boxes, in the images likely to contain objects so that the rest of the computations (classification and finer localization) can be made only in these probable regions. 2.2 Other object detection methods There are several ways, apart from CNNs, to perform object detection. The different methods have different strengths and weaknesses, such as different computation time, accuracy or performance on different types of objects. For example, methods based on HOG [10] or SIFT [11] may be more suitable for on-board classification (for example on a drone) because it requires less memory and works on a CPU. However, as of today CNNs are the primary approach to most object detection problems [12] with outstanding performance.

12 6 CHAPTER 2. BACKGROUND 2.3 Common Datasets There are several widely used datasets in the object detection community to train and evaluate performance of different methods and networks on standard images. Some of the largest are the 20 category dataset PASCAL Visual Object Classes (VOC) challenge [13], ImageNet [14] with millions of classified images and at least one million images with corresponding bounding boxes and Microsoft COCO [15] which presents a dataset of more than 300,000 images and 80 labeled categories, including smaller objects such as fruit, cell phones and computer mouses in natural, everyday scenes. 2.4 Metrics There are some common methods for measuring performance of object detection. Intersection over Union Introduced in [13], the Intersection over Union (IoU) is a metric commonly used in object detection for evaluating correctness of a bounding box. IoU is computed by IoU = Intersection area Union area where the intersection area is the area of the intersection between the predicted bounding box and the true bounding box (their overlap). Similarly the union area is the union of the two. A predicted bounding box close to the true bounding box yields an IoU close to 1. Precision and Recall Precision of a classifier on a dataset is defined as the number of true positives over the total number of detected positives, that is Precision = (2.1) Number of true positives Number of true positives + Number of false positives. (2.2) Here, a true positive is a detection of an instance that is actually present in the image. A false positive is a detection of an instance that is not present in the image. That is, the number of true positives is the number of objects correctly classified as a certain class and the number of false positives is the number of objects incorrectly classified as that certain class. When no false positives are detected the precision is 1. The precision is then 1 regardless of whether there are any true positives. A precision of 1 means that all detected objects were true, but doesn t say anything about how many actually existing objects were not detected. In the same way, the recall of a classifier on a dataset is defined as the number of true positives over the true number of instances, that is Recall = Number of true detected positives Number of true detected positives + Number of false not detected negatives. (2.3)

13 CHAPTER 2. BACKGROUND 7 Here, a false negative is an instance of an object that is present in the image but not detected. When there are no false negatives the recall is 1, regardless of whether there are any true positives or not. A recall of 1 only means that no objects that should have been detected were left out and doesn t say anything about the quality of the actual predictions made. It is desirable to maximize both precision and recall, so that few instances are wrongly classified while at the same time few instances that should have been classified are left out. F1 score The F1 score is a way to summarize precision and recall in one number to evaluate the overall performance of a classifier. The F1 score is defined as F 1 = 2 precision recall precision + recall. (2.4) Mean Average Precision Average precision is related to the area under a precision-recall curve for a category, that is, precision plotted to recall. It is desirable for this area to be large in order for precision and recall to be maximized. The mean Average Precision (map) is the Average Precision averaged over all class categories in a dataset and is a common way of evaluating how well an object detection method performs.

14 Chapter 3 Related work In this chapter previous work related to the project is briefly described. First, research related to drones and how computer vision has been used on drones is surveyed. Secondly, research in the area of object detection is described, followed by a short description of fine tuning of a CNN. 3.1 Drones Drones are platforms capable of flying, e.g. small unmanned helicopters. A drone, as other robots, can be programmed to different levels of autonomy, from being radio controlled to being fully autonomous. To achieve full autonomy a well developed navigation and perception system is required. Drones are very flexible compared to ground based robots, as they can fly over and around things and thus view objects from a larger variety of angles. However, there is a limitation as to how much weight one can put on a drone which in turn limits the number of sensors, the on board computational power and so on. However, data can be streamed to a larger computer and processed there. In 2014 imagery from a drone was used to count animals in images of natural environments [1]. They used imagery taken from high altitude ( meters) with a skewed angle compared to "human" photos, which are usually taken from an altitude of about 1-2 meters from the front. Since their goal was to perform object detection on board the drone GPU-requiring CNN methods were not applicable at the time and a HOG [10] based method was used. [1] stresses that most object detection algorithms are trained and tested on images taken from a "human" perspective, that is, from a certain height and angle, and can thus not be assumed to perform well on other types of images. Drones have also been used for tracking objects on the ground, as in [16] where color thresholding was used to detect a colored rectangle to follow. In this case, no classification of the object was made. Further, [2] used an RGB camera together with a heat camera to detect humans from on board a drone. They first found human-temperature silhouettes and then used a cascade of boosted classifiers with Haar-like features on the RGB image of the corresponding position to ensure the presence of a human. Also here it is stressed that the images of in- 8

15 CHAPTER 3. RELATED WORK 9 terest are very different from images generally used in computer vision (with a "human" perspective) since they are taken from a large hight and thus a skewed angle. 3.2 Object detection Already in 1989 the first deep learning approach to object detection was proposed in [17] where supervised back-propagation networks were used to detect hand written digits in zip codes. However, until the year 2012 methods based on feature extraction such as SIFT [11] and HOG [10] were in focus and performance on the PASCAL VOC challenge improved slowly. In 2012, [18] reintroduced the usage of Convolutional Neural Networks in object detection and won the ImageNet Large-Scale Visual Recognition Challenge [14] with their network called AlexNet. This was the starting point for a lot more research on CNNs in object detection. [5] combined AlexNet with region proposals in 2013 (they used Selective Search [19]) and thus improved performance on PASCAL VOC significantly (from previous best result of 35.1% map [19] to 53.7% map). The method was named R-CNN (Regions with CNN features) since it is based on first generating region proposals for the input image and then extracting a feature vector for each proposed region using a CNN. Lastly each region is classified using a Support Vector Machine (SVM). In 2014, [12] used the features extracted by a CNN called overfeat [4] in various recognition tasks such as image classification and scene recognition. They achieved astounding results compared to current state-of-the-art methods in all tasks on various datasets, including PASCAL VOC [13], and thus showed that deep learning with CNNs should be considered the primary approach in any visual recognition task. Spatial Pyramid Pooling networks (SPPnets [6]) took on the problem of earlier CNNs requiring fixed sized input images in 2015 by adding a SPP layer between the last convolutional layer and the first fully-connected layers. In this way, the need to crop or warp images in order to run them through a CNN was eliminated. Further, SPPnets speed up R- CNN by sharing computation across regions. That is, in SPPnets the features of an image are computed only once instead of separately for each region of interest. SPPnets proved to be x faster than R-CNN and to perform better or comparable [6]. Also in 2015, Fast R-CNN [7] improved the work of R-CNN [5] further by proposing a network that can simultaneously be trained to classify objects and to tune their spatial locations - leading to a significant increase in training speed (9x faster than R-CNN [5] and 3x faster than SPPnet [6]) while also achieving better accuracy on PASCAL VOC (66% map). ResNet [20] introduced a deep residual learning framework in the end of 2015 which allowed networks to grow much deeper than before. They reformulated the network layers as learning residual functions with reference to inputs instead of learning unreferenced functions and showed that the residual mappings can be easier optimized than the original mappings. In 2016 the R-CNN algorithm was even further developed by integrating the fast R-CNN

16 10 CHAPTER 3. RELATED WORK [7] with a Region Proposal Network (RPN), resulting in faster R-CNN [9]. Until [9], the main bottleneck in object detection was the region proposals, which were often time consuming. The RPNs of [9] share convolutional layers with the object detection networks [7], [6] and simultaneously regress region bounds and the probability of the region to contain an object at each location on a grid of the image. Usage of RPNs ensure nearly cost free region proposal and also improved accuracy of the proposed regions. Later in 2016, [8] proposed a Region-based Fully Convolutional Network (R-FCN) which improved object detection performance by further centralizing the method. While the previous methods had, to different extents, performed some computations several times for different regions of the images R-FCN is fully convolutional with almost all computations shared across the whole image. Until today, R-FCN is considered a state-of-theart method for object detection and therefor the work of this project will be based on R- FCN.

Chapter 4 The Object Detection Algorithm According to the findings of the previous chapter R-FCN [8], being one of the best object detection frameworks of today with competitive accuracy and fast

17 Chapter 4 The Object Detection Algorithm According to the findings of the previous chapter R-FCN [8], being one of the best object detection frameworks of today with competitive accuracy and fast computations, is used in this project. The details of R-FCN can be found in the paper [8] but a brief overview of the architecture used is given here. Figure 4.1: Key idea of R-FCN for object detection. Figure copied from [8]. Figure 4.1 shows the overall architecture of R-FCN. The first "white box" consists of a backbone network, in this case ResNet-101 [20]. ResNet-101 is a residual network with 100 convolutional layers followed by a pooling and a fully connected classification layer. Here, the two last layers are removed and the 100 convolutional layers are used to compute feature maps. From these feature maps k k(c +1) position-sensitive score maps are computed (the last "plate" in Figure 4.1). Here C is the number of object categories (+1 for background) and k is the dimension of the position-sensitive score maps (3 3 in the figure). These score 11

12 CHAPTER 4. THE OBJECT DETECTION ALGORITHM maps are activated on a specific relative position to a certain object, for example top-left or right-bottom.

18 12 CHAPTER 4. THE OBJECT DETECTION ALGORITHM maps are activated on a specific relative position to a certain object, for example top-left or right-bottom. For each object category there are k 2 score maps. An example showing how the position-sensitive score maps work is showed in Figure 4.2. Figure 4.2: Illustration of the position-sensitive score maps of R-FCN, with k = 3. The figure is copied from [8]. Simultaneously, Regions of Interest (RoIs) are extracted using the Region Proposal Network (RPN) of [9] and the same output feature maps. A pooling layer then generates C + 1 channel score maps for each RoI, using the information from the position-sensitive score maps. Finally, the categories and bounding boxes are computed using a Softmax function [21] and a box regression convolutional layer respectively. The network used in this project is pre-trained on a 80 class dataset from Microsoft COCO [15]. Several of the classes present in the dataset are "small", as defined in Chapter 1.

19 Chapter 5 Method The hypotheses stated in Section 1.1 are addressed in three different experiments. In this chapter, the general method of the experiments is described. Each experiments is described in more detail in Chapter Experiment Design In all of the experiments a hand held camera is used instead of a camera mounted on a flying drone. Not using a real drone facilitates the experiments greatly since controlling a drone is difficult. Further, images obtained by hand are assumed to be very similar to corresponding images that would have been obtained using a drone. In Chapter 7 detections made on images recorded from a real drone are displayed to show that this is true. The procedure of each experiment includes the following steps: 1. Record various image sequences and extract a number of images (between 20 and 25), equally spaced in time. 2. Manually annotate ground truth bounding boxes to the images. 3. Input the images to R-FCN and save the resulting bounding boxes. 4. Compare the bounding boxes from R-FCN with the ground truth bounding boxes. The evaluation method is described in Section 5.2. In step 1 between 20 and 25 images are extracted from the image sequences. In each of these images several object instances are generally present so that the total number of objects in each dataset is larger than the number of images. Two of the experiments are designed to directly address some of the hypotheses stated in Section 1.1. In one experiment the number of detected objects in three different datasets, one recorded with a normal camera, one recorded with a fish eye camera and one recorded with a fish eye camera and then rectified, are compared in order to determine with which type of camera most objects can be detected (hypotheses 2 and 3 in Section 1.1). 13

determine from what distance most objects can be detected (hypothesis 4 in Section 1.1).

20 14 CHAPTER 5. METHOD In another experiment the number of detected objects in four different datasets recorded from different horizontal and vertical distances to a table with objects on it are compared in order to determine from what distance most objects can be detected (hypothesis 4 in Section 1.1). The third and last experiment compares the number of detected objects in three datasets recorded with different camera angles in order to determine how to mount the camera on the drone. 5.2 Evaluation The experiments of Chapter 6 each contain at least two different datasets. Performances of R-FCN on the different datasets are compared rather than defining a threshold for a "good" or "bad" performance. That is, since each experiment is designed to show in what way most objects can be detected, it is of more interest to see on which one of the datasets R-FCN performs better than to state whether it performs well on each individual dataset. To evaluate performance, precisions, recalls and F1 scores are computed both for individual class categories and as an average over all categories present in a dataset. The procedure of computing these values can be reviewed in Chapter 2. Further, since the goal is to detect as many objects as possible, as mentioned in Section 1.1, the total number of correctly detected objects as well as the total number of objects actually present in the images are counted for each dataset. In computing precision and recall what is considered a correct classification, or a true positive, needs to be defined. Here a IoU threshold of IoU > 0.5 for a true positive is used, as shown in Figure 5.1. This is the standard IoU threshold of PASCAL VOC [13], and is also used in for example [8] and [9]. (a) IoU > 0.5, positive. (b) IoU < 0.5, negative. Figure 5.1: Illustration of the IoU requirement for a true positive.

21 Chapter 6 Experiments This chapter contains three sections which each describe one experiment. They start with a short motivation of why the experiment was performed followed by a description of the experimental setup, the results and finally a short analysis of the results. 6.1 Fish Eye Camera The goal of this experiment was to show whether a network trained on non-fish eye images can be used on fish eye images with satisfactory results. To the best of the authors knowledge this has not been tested before and the results are used in the choice of camera to use in the remainder of the project Experimental Setup A fish eye camera (with a field of view close to 180 ) and a normal angled camera were mounted close to each other (the fish eye cameras lens about 3 cm above the normal camera lens) facing the same way, as shown in Figure 6.1. Figure 6.1: Figure describing the setup of the cameras used in the fish eye experiment. The circle with a F represents the lens of the fish eye camera and the circle with a N represents the lens of the normal camera. 15

16 CHAPTER 6. EXPERIMENTS Image sequences were recorded simultaneously with the two cameras walking around an office room and with various objects in it.

22 16 CHAPTER 6. EXPERIMENTS Image sequences were recorded simultaneously with the two cameras walking around an office room and with various objects in it. A third image sequence was created with rectified versions of the fish eye images. 25 images, equally spaced in time, were extracted from each image sequence and input to the R-FCN. Since the goal of this experiment was to compare performances on the three dataset rather than to determine how "well" the network performs on a global scale this relatively small number of images was sufficient. Further, as mentioned in 5, each image generally contains more than one object so the number of objects in the datasets is larger than the number of images. The three datasets were also manually annotated with bounding boxes for the evaluation. Then, the performances on the three datasets were evaluated, comparing the annotated ground truths with the detection results from R-FCN for all datasets. (a) Example image from the normal camera. (b) Example image from the fish eye camera. (c) Example of a rectified image from the fish eye camera. Figure 6.2: Examples of the images used in the fish eye experiment. Example images from the three datasets can be seen in Figure 6.2. It can be seen that the image quality of the two cameras is not exactly the same. That is, comparing Figures 6.2a and 6.2b there are some differences other than the field of view. For example, Figure 6.2a is darker than Figure 6.2b and this fact may affect the detection performance slightly. However, also the training data [15] is from different cameras of varied quality and the differences in image quality should not affect the results too much Results Table 6.1 summarizes the results for all three datasets in the experiment. For both the normal angled camera, the fish eye camera and the rectified image of the fish eye camera the total number of ground truth instances and correct detections of all present classes and the averaged precision, recall and F1 score over all present object classes are shown. The average precision for the fish eye camera was 1.0 which means that there were no false detections in the dataset. Further, the average recall of the fish eye camera was lower than that of the normal camera which means that a larger fraction of the present objects were not detected. The F1 score, which summarizes precision and recall was slightly lower for the fish eye camera than for the normal camera, suggesting lower performance.

CHAPTER 6. EXPERIMENTS 17 Camera Number of Number of Number of Average Average Average ground truths correct incorrect precision recall F1 score detections detections Normal camera 149 59 4 0.902 0.

1: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score averaged over all classes for one dataset

23 CHAPTER 6. EXPERIMENTS 17 Camera Number of Number of Number of Average Average Average ground truths correct incorrect precision recall F1 score detections detections Normal camera Fish eye camera Rectified image Table 6.1: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score averaged over all classes for one dataset recorded with a normal camera, one recorded with a fish eye camera and one with rectified images recorded with a fish eye camera. The lowest performance was that of the rectified image of the fish eye camera. Fewer objects were also detected on this dataset compared to the fish eye dataset. The total number of ground truths in the rectified dataset is lower than that of the fish eye dataset since some parts of the images are lost in the rectification process. The total number of correct detections was highest for the fish eye camera, nearly twice the number of correct detections in the normal camera dataset. (a) Example image from the normal camera with bounding boxes. (b) Example image from the fish eye camera with bounding boxes. (c) Example of a rectified image from the fish eye camera with bounding boxes. Figure 6.3: Examples of the bounding boxes generated by R-FCN in the fish eye experiment. Figure 6.3 shows examples of the bounding boxes found by R-FCN in the three datasets. Tables 6.2, 6.3 and 6.4 show the results of the fish eye experiment for each present class. They show that some object classes are more easily detected than others. For example, no knifes were detected in any of the datasets while many bottles, cups and keyboards were detected. This is probably because the distance and viewing angle was better suited (more similar to that of the training data) for the latter objects. Table 6.3 shows a precision of 1.0 for all classes which is because no false positives were detected in the dataset.

24 18 CHAPTER 6. EXPERIMENTS class apple banana bottle cell_phone chair cup diningtable fork keyboard knife mouse orange tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.2: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in a dataset recorded with a normal camera. class apple banana bottle cell_phone chair cup fork keyboard knife laptop mouse orange tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.3: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in a dataset recorded with a fish eye camera. class apple banana bottle cell_phone chair cup fork keyboard knife laptop mouse orange tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.4: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in a dataset recorded with a fish eye camera and then rectified. Comparing Tables 6.2, 6.3 and 6.4 there are some differences in what object categories are present. For example, only Table 6.2 contains the category "diningtable", but on the other hand does not contain the category "laptop". There are different reasons for these differences. The diningtable class is present in Table 6.2 because an incorrect detection of a diningtable was made on that dataset. The laptop class is not present because the laptop seen by the fish eye camera could not be seen by the normal camera (see Figures 6.2 and 6.3, where a laptop can be seen on the right hand side of the fish eye and rectified images) Analysis The total number of correct detections of the fish eye camera was higher than that of the normal camera, strengthening the hypothesis that overall more objects can be detected using a fish eye camera. Further, the F1 score was a little lower for the fish eye camera than for the normal camera but not much. It can thus be said that it is advantageous to use a fish eye camera for object detection using a network trained on normal images if the goal is maximizing the total number of detected objects. Of course, the reason for this advantage of the fish eye camera is the wider field of view and not that it is easier to

CHAPTER 6. EXPERIMENTS 19 detect objects in fish eye images. However, since the fish eye camera performed well it is used in the remainder of the project.

25 CHAPTER 6. EXPERIMENTS 19 detect objects in fish eye images. However, since the fish eye camera performed well it is used in the remainder of the project. Surprisingly, objects were detected not only in the center of the fish eye images but also on the distorted borders. Figure 6.4 is an example of this. This fact speaks for the advantage of using a fish eye camera to detect many objects - some of the "extra" detected objects compared to the normal camera are actually outside of the normal cameras field of view and the numbers cannot be only due to, for example, different image quality. Figure 6.4: An example of detections on the borders of a fish eye image. 6.2 Distance to Objects The goal of this experiment was to examine from what distance to view "object clusters" in order to detect as many objects as possible. More objects are expected to be detected when the camera is closer to the objects as compared to when it is further away. The experiment shows whether this is true or not Experimental Setup A similar office environment as in the previous experiment (Section 6.1) was viewed by the fish eye camera (since it performed best in detecting as many objects as possible). More specific, a table with some objects on it was viewed from different horizontal and vertical distances. The distances were measured from the front edge of the table. The camera was facing forward. An image sequence was recorded from each distance and height from the table edge and 20 images, equally spaced in time, extracted and run through R-FCN. Like in Section 6.1, this relatively small number of images is sufficient since the goal of the experiment is to compare datasets of equal sizes rather than determining how good the performance of the network is on a more global scale. Bounding boxes for objects in the images were also manually annotated and the results compared as explained in Section 5.2.

20 CHAPTER 6. EXPERIMENTS In order to determine from what distance most objects can be detected two different horizontal distances and two different vertical distances were examined.

26 20 CHAPTER 6. EXPERIMENTS In order to determine from what distance most objects can be detected two different horizontal distances and two different vertical distances were examined. First, a horizontal distance of 0 cm between the camera and the table edge was used as a "close" distance. Then, as a "far away" distance 50 cm was used. Note that a typical ground robot would often have difficulties getting even this close to objects. Further, the closest vertical distance was chosen to be 15 cm (not 0 cm because it would not be possible to fly a drone that close to the table, and a camera is typically not mounted on the lower parts of a drone). Then, the "far away" vertical distance was chosen to be 35 cm, from where many objects were still present in the image. That is, if the camera was moved even higher, there were few objects in the image because of the camera facing forward. Figure 6.5 illustrates the different camera positions with respect to the table and Figure 6.6 shows examples of images from each dataset. Figure 6.5: Illustration of the distances in the experiment. The table is seen from the side. The dots represent the different camera positions.

CHAPTER 6. EXPERIMENTS 21 (a) 0 cm horizontal distance, 15 cm vertical distance. (b) 0 cm horizontal distance, 35 cm vertical distance. (c) 50 cm horizontal distance, 15 cm vertical distance.

While more ground truths are present in the two datasets recorded from a 50 cm horizontal distance the number of correct detections is larger in the 0 cm horizontal distance datasets.

Horizontal Vertical Number of Number of Number of Average Average Average distance [cm] distance [cm] ground truths correct incorrect precision recall F1 score detections detections 0 15 154 56 0 1.

27 CHAPTER 6. EXPERIMENTS 21 (a) 0 cm horizontal distance, 15 cm vertical distance. (b) 0 cm horizontal distance, 35 cm vertical distance. (c) 50 cm horizontal distance, 15 cm vertical distance. (d) 50 cm horizontal distance, 35 cm vertical distance. Figure 6.6: Example images from different distances Results Table 6.5 shows the results of the distance experiment. While more ground truths are present in the two datasets recorded from a 50 cm horizontal distance the number of correct detections is larger in the 0 cm horizontal distance datasets. This means that the average recalls and the F1 scores in these datasets are higher. Horizontal Vertical Number of Number of Number of Average Average Average distance [cm] distance [cm] ground truths correct incorrect precision recall F1 score detections detections Table 6.5: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score averaged over all classes for datasets recorded with different horizontal and vertical distances to a table.

Figure 6.7: Example images showing the resulting bounding boxes for different distances. Tables 6.6, 6.7, 6.8 and 6.

In the two 0 cm horizontal distance datasets, the 15 cm vertical distance dataset show a larger recall of keyboards

28 22 CHAPTER 6. EXPERIMENTS Figure 6.7 shows examples of the bounding boxes found by R-FCN for the different distance datasets. (a) 0 cm horizontal distance, 15 cm vertical distance. (b) 0 cm horizontal distance, 35 cm vertical distance. (c) 50 cm horizontal distance, 15 cm vertical distance. (d) 50 cm horizontal distance, 35 cm vertical distance. Figure 6.7: Example images showing the resulting bounding boxes for different distances. Tables 6.6, 6.7, 6.8 and 6.9 show the results for each present object class in the datasets. In the two 0 cm horizontal distance datasets, the 15 cm vertical distance dataset show a larger recall of keyboards and mice than the 35 cm vertical distance dataset while it is the other way round for bottles and cups. This indicates that each type of object has an "optimal" viewing distance which needs to be adjusted in order to detect that type of object. Smaller objects need to be viewed from a closer distance.

29 CHAPTER 6. EXPERIMENTS 23 class apple banana bottle cup keyboard mouse scissors tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.6: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 0 cm away from and 15 cm above table. class apple banana bottle cup keyboard mouse scissors tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.7: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 0 cm away from and 35 cm above table. class apple bottle chair cup keyboard laptop mouse scissors tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.8: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 50 cm away from and 15 cm above table. class apple banana bottle chair cup keyboard laptop mouse scissors tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.9: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 50 cm away from and 35 cm above table Analysis The experiment showed, as expected, that small objects can be more easily detected from a closer horizontal distance. The experiment didn t show as clear results for the vertical

30 24 CHAPTER 6. EXPERIMENTS distance which could be because the the change in vertical distance was not as large as the one in horizontal distance (20 cm compared to 50 cm). However, it is possible to conclude that being close to small objects increases the chance of detecting them while larger objects may need a larger distance for best performance.

CHAPTER 6. EXPERIMENTS 25 6.3 Camera Angle I this experiment three different camera angles are tested.

One of the advantages of using a drone for detecting objects in a room is that it can fly over large objects, such as tables, in order to get a different kind of view than, for example, a ground

31 CHAPTER 6. EXPERIMENTS Camera Angle I this experiment three different camera angles are tested. The goal is to determine how to mount the camera on the drone for best object detection performance. One of the advantages of using a drone for detecting objects in a room is that it can fly over large objects, such as tables, in order to get a different kind of view than, for example, a ground robot can. Therefor, in this experiment the camera will be moved above and along a table Experimental Setup It is of interest to be as close to the objects as possible, however a drone cannot fly too close to things (and again, a camera is generally not mounted on the lower parts of a drone). In an office environment, the fish eye camera was moved from one side to the other about 0.4 meters above a table with objects on it. The distance of 0.4 meters was chosen trying to keep the camera as close as possible to the table, because of the results of the distance experiment in Section 6.2. However, because in this experiment the drone was moved along the table, and since there were objects on the table it was not possible to keep a closer distance. The camera was moved along the table three times, first with a 0 degree angle of the camera, then with a 45 degree angle and lastly with a 90 degree angle. What is meant by the different angles is demonstrated in Figure 6.8. Each time, an image sequence was recorded and 20 images extracted and run through R-FCN. Examples from the three datasets are shown in Figure 6.9. The images were also manually annotated with bounding boxes and the results compared. (a) 0 degree camera. (b) 45 degree camera. (c) 90 degree camera. Figure 6.8: Illustration of the different camera angles. The green arrows show the directions in which the cameras were moved.

26 CHAPTER 6. EXPERIMENTS (a) Example image from the 0 degree dataset. (b) Example image from the 45 degree dataset. (c) Example image from the 90 degree dataset. Figure 6.

The F1 score is a lot higher for the 90 degree dataset, which is expected as most training data images were probably taken from a close to 90 degrees perspective.

32 26 CHAPTER 6. EXPERIMENTS (a) Example image from the 0 degree dataset. (b) Example image from the 45 degree dataset. (c) Example image from the 90 degree dataset. Figure 6.9: Examples images from the different camera angle datasets Results Table 6.10 shows the results of the angle experiment. The F1 score is a lot higher for the 90 degree dataset, which is expected as most training data images were probably taken from a close to 90 degrees perspective. Angle [deg] Number of Number of Number of Average Average Average ground truths correct incorrect precision recall F1 score detections detections Table 6.10: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score averaged over all classes in datasets from different angles. Figure 6.10 shows the bounding boxes predicted by R-FCN in example images from the three datasets in the experiment. (a) Example image from the 0 (b) Example image from degree dataset with bounding the 45 degree dataset with boxes. bounding boxes. (c) Example image from the 90 degree dataset with bounding boxes. Figure 6.10: Examples images from the different camera angle datasets with bounding boxes from R-FCN.

33 CHAPTER 6. EXPERIMENTS 27 Tables 6.11, 6.12 and 6.13 show the results for each object category present in the different datasets. It can be seen that the higher F1 score of the 90 degree dataset compared to the other two is mostly due to a higher recall of large objects, such as TV-monitors and chairs. class banana bottle cell_phone chair cup keyboard laptop mouse tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.11: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 0 degrees camera angle. class banana bottle cell_phone chair cup keyboard mouse tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.12: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 45 degrees camera angle. class banana bottle chair cup keyboard mouse tvmonitor number of ground truths number of correct detections number of incorrect detections precision recall F1 score Table 6.13: Number of ground truth instances, number of correctly detected instances, number of incorrectly detected instances, precision, recall and F1 score for each class in dataset from 90 degrees camera angle. Similar to in Section 6.1, not all object categories are present in all of Tables 6.11, 6.12 and The reasons are the same, either that misclassifications were made or that object instances were outside of the field of view in some of the datasets Analysis The performance on the 0 degree dataset is very low compared to the other two, which is logical since the objects look very different from this point of view compared to the training data. This fact was mentioned in Chapter 3, as others ([1], [2]) working on drones

28 CHAPTER 6. EXPERIMENTS had already stressed the difficulties in detecting objects in images that are different from training data.

The F1 score of the forward facing (90 degree) camera dataset is much larger than that of the other two datasets. This suggests that the camera should be mounted facing forward.

Many of these chairs were in the background of the images (Figure 6.12 is an example of this) and thus distorted in the other datasets.

34 28 CHAPTER 6. EXPERIMENTS had already stressed the difficulties in detecting objects in images that are different from training data. For an example of how different objects may look from above, see Figure 6.11 and note how the cup looks almost completely round as compared to what a cup looks like from the side. Figure 6.11: An example of an image from the 0 degree dataset. The F1 score of the forward facing (90 degree) camera dataset is much larger than that of the other two datasets. This suggests that the camera should be mounted facing forward. However, as can be seen in Table 6.13, one of the reasons for the superior performance of the 90 degree dataset is that more large objects (chairs) were detected. Many of these chairs were in the background of the images (Figure 6.12 is an example of this) and thus distorted in the other datasets. Therefor, depending on the environment where the drone will move, how it is meant to fly and what objects it is meant to detect it may be reasonable to mount the camera slightly tilted. (a) 45 degree camera angle. (b) 90 degree camera angle. Figure 6.12: Example image showing detection of chairs in the background.

Chapter 7 Real Drone This chapter describes a small experiment that was made without any formal evaluation in order to show whether or not R-FCN could be used on image data from a real, flying drone.

Attempts were made to fly as close as possible to a table with objects on it, but no distances as close as in the experiments of Chapter 6 could be reached due to difficulties in controlling the

35 Chapter 7 Real Drone This chapter describes a small experiment that was made without any formal evaluation in order to show whether or not R-FCN could be used on image data from a real, flying drone. A Parrot Bebop 2 drone [22], with a 180 field of view camera (as suggested in Section 6.1), mounted facing forward (as suggested in Section 6.3), was manually navigated in an office room. Attempts were made to fly as close as possible to a table with objects on it, but no distances as close as in the experiments of Chapter 6 could be reached due to difficulties in controlling the drone. A stream of images was recorded and the images input to R-FCN. Figure 7.1 shows an extract of images with bounding boxes from the above mentioned dataset. Figure 7.1: Example images from a dataset recorded from a real flying drone with bounding boxes predicted by R-FCN. 29

Spatial Localization and Detection. Lecture 8-1

Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday