Automatic Indoor 3D Scene Classification using RGB-D data

Size: px

Start display at page:

Download "Automatic Indoor 3D Scene Classification using RGB-D data"

Amie Carr
6 years ago
Views:

1 MSc Artificial Intelligence Track: Computer Vision Master Thesis Automatic Indoor 3D Scene Classification using RGB-D data by Nick de Wolf September EC Supervisor: Prof Dr T. Gevers S. Karaoglu Assessor: dr. P.H. Rodenburg

2 Abstract Being able to understand natural scenes is crucial to many activities for humans and animals. A large portion of scene understanding is derived from vision, and as a result a great amount of research has been put into this topic in the field of computer vision. Most of the research has been oriented towards applying 2D based approaches, while 3D approaches have only seen an increase in popularity in recent years. Depth sensors have become more precise and affordable, which paved the way for gathering 3D data on a larger scale. In this work, we investigate the potential performance gains that this new depth data can bring to existing 2D methods in two different tasks withing scene classification, namely object proposal generation and scene category classification. First, we extend a popular 2D approach for object proposal generation with two novel depth-based features that use the information gathered from the 3D point cloud. The results show that the combination of the default algorithm with these additional features can achieve similar recall and MABO scores, while generating significantly less object proposals. Second, we created a global depth-based feature, that uses the detected objects in a scene, for the task of scene category classification. The spatial relations of the objects are used to generate a context based feature. This feature is used in combination with deep learning approaches on both color and depth information, to train a SVM classifier for 21 different scene classes. The results show that using this feature in combination with the deep learning approaches yields an increase in map scores in the classification task.

3 Acknowledgements I would like to thank my supervisor Theo Gevers for providing me the opportunity to work on this interesting topic and providing guidance during our meetings. In addition I would like to thank my colleagues at 3DUniversum, their knowledge regarding the related fields of my thesis and their willingness to help, has certainly improved the quality of this work. In particular I would like to thank my daily supervisor Sezer Karaoglu for his guidance, ideas, and feedback during the process of writing this thesis. I am also grateful to Morris Franken for his assistance with setting up the server, and his extended knowledge of the tools used in this thesis. ii

4 Contents Abstract i Acknowledgements ii 1 Introduction Motivation Goal Thesis Outline Background Object proposals Image Oversegmentation RGB-D Oversegmentation Selective Search Scene classification Convolutional Neural Networks Convolutional layer Pooling layer Fully connected layer GoogleNet Features using context Methodology Extending Selective Search CNN with additional depth features Fine-tuning CNN Experiments Dataset Properties Data pre-processing and Analysis Implementation Extended Selective Search Scene Classification Results Extended Selective Search iii

5 Contents iv Evaluation Metrics Reducing segment count Varying Colorspaces Weighing depth features Varying IoU overlap threshold Scene Classification Evaluation Metrics Combining measures Conclusion 44 Bibliography 47

6 Chapter 1 Introduction 1.1 Motivation The ability to understand a scene as a human is crucial to many natural activities, such as recognition, navigation, and general interaction with the environment. Knowing where you are and in what kind of environment you are, can also be useful in tasks where robots have to navigate, as their navigation models could be fine-tuned to all kinds of different room types. Scene understanding has always been an active field of research within computer vision, and significant progress has been achieved over the past decades. Until recently, research has mostly focused on using standard RGB images for scene classification tasks. Stateof-the-art methods [1, 2] have used hand-crafted features such as HOG [3], SIFT [4], and SURF [5]. These methods use a variety of techniques, such as computing descriptors by generating keypoints in images. These descriptors can be used to efficiently describe and compare images by matching a generated descriptor of an image against already classified images. In order to create efficient features, a sufficient amount of prior knowledge is required, and the the accuracy of the resulting method is largely determined by the creativity of the designer. The current state-of-the-art approaches mostly use Convolutional Neural Networks (CNNs) [6], instead of hand-crafted features in the scene classification task [7 10]. One of the big advantages of CNNs over handcrafted features is that one does not have to worry about designing features by hand, instead the CNN essentially learns the features. CNNs have become highly involved in the learning process for computer vision tasks. One of the reasons that CNNs have only become popular recently is that CNNs require a large amount of image data to be trained properly. In recent years, large datasets of image data have 1

7 Introduction 2 become available [11 13], which in turn allowed for large CNNs to be trained, without the costs associated with gathering all this data. Another reason is the improvement in the computational power of systems that are now widely available. Today, most of the computations with a CNN are performed using a GPU or a cluster of GPUs [14]. While most state-of-the-art conventional methods and CNNs that work with RGB images perform well on most scenes [8 10, 15, 16], their performance can deteriorate under different conditions such as varying lighting conditions (e.g. highlights, shadows, and dim light) [17]. Moreover, these methods inherently lack geometric information about the scene because in essence, a picture is a 2D projection of the 3D world. This loss of information makes the scene classification algorithms less robust to varying image conditions. By adding depth as an additional channel to the image, geometric information could be exploited at a pixel level. Geometric information can be useful for deriving contextual information from a scene. If we look at the task of object detection, we can see that previous work has used the relations between objects before [18 21]. Rabinovich et al. [19] demonstrated that the presence of a specific object class inside a scene can influence the probability of an object of another class being present. The work of Torralba et al. [18] showed that certain objects occur more often in certain scenes than others. For instance, it is more likely to find a toilet inside a bathroom than inside an office. By adding additional depth information to each pixel, one could also obtain properties such as the relative sizes between objects or the distance between objects in the real world. Depth measurements are invariant to variables such as lighting conditions, and it could lead to creating more robust features. Another benefit is that, unlike methods that attempt to estimate the 3D structure of a scene with techniques like Structure from Motion [22], this depth data would already be available. Hence, generating the 3D structure would be computationally less expensive, and provided that the sensor of the depth data is accurate, would also be more precise. Being able to use this additional information freely would enable algorithms to extract the scene structure more accurately, which could improve the performance in various tasks within scene classification. In recent years, depth sensors have become available on the consumer market for an affordable price. This enables us to acquire reliable depth maps at a very low cost, stimulating the use of this additional channel of data. Currently depth sensors are accurate for indoor scenes, where most affordable sensors have an effective range from 50 centimeters up to five to seven meters. By combining a depth sensor with a regular camera, the lack of geometric data in RGB images can be solved, resulting in RGB- D data [23]. In order to make optimal use of the depth data that is generated from the depth sensors, the scope of the scene understanding tasks will be limited to indoor

8 Introduction 3 scenes in this work. By using indoor scenes, instead of outdoor scenes, there is a higher likelihood of the edges of the scene staying within the effective range of the depth sensor, as opposed to outdoor scenes. For the approach in this thesis, we will use object-object context information. Hence, it is important to be able to find the objects inside the scene. To detect objects in an image, one first has to find their approximate location. Afterwards, a classifier can determine what kind of object is present at the location, if any. Until recent years, the Sliding Windows paradigm was used often in most successful approaches [3, 24 26]. This paradigm means that it has to systematically go through an image in search for potential objects. The main problem lies in the number of locations that have to be searched for potential candidates. If you generate an large number of boxes, with a large number of different dimensions, you are certain that you will be able to find every possible object location in the image. Unfortunately, applying detection algorithms to a high number of objects locations is computationally intractable for most state-of-the-art object detectors [8 10, 15, 16]. In recent years, approaches have been suggested that can provide a trade-off between high detection quality and computational tractability under the name of object proposals [27 32]. Although they all use different approaches, they work under the common assumption that all objects have certain properties that differentiate them from the background, and can thus be localized. These methods are able to maintain a high recall of the objects present in the scene, while using significantly less windows when compared to using Sliding Windows. This reduction in generated proposals, could in turn improve the object detection results [33]. Most common object proposal generators use just the color values, and do not account for depth information. Yet, this additional depth information can be used to reduce the number of boxes significantly, while only losing little accuracy in the object proposal generation step. The contributions of this thesis are two-fold. The main contribution is the introduction of a novel approach for improving object proposals for object detection, by implementing the additional depth channel into a popular existing method called Selective Search [32]. The second contribution is a potential application of these improved object proposals in combination with RGB-D data. For this task, we assume that the object proposals are used effectively in an object detector, and that the resulting detected objects are used to model the scene context. The detected objects are used to model the objectobject spatial relations in a scene, and we show that the context provided by the detected objects can be used in combination with the depth information to improve scene category classification. These spatial relations are represented by a feature that uses the objects present in a scene and the distance between these objects. The performance of this novel

9 Introduction 4 feature is compared to the performance of a RGB CNN [8] and a depth CNN [7] for the task of scene category classification. 1.2 Goal The focus of this thesis is on investigating the potential use of the additional depth channel in both the region proposal generation step and the scene category classification task. The research questions for these tasks are: 1. How can the additional depth channel of RGB-D data be used to improve the results of existing object proposal methods, originally implemented for RGB data? 2. Can the spatial object-object relations be used to generate a feature that can be used with state-of-the-art approaches for scene classification? 3. How do object-object based context features, based on the global context in an image, compare to local depth based features? 1.3 Thesis Outline The organization of the remainder of this thesis is as follows: in Section 2 the background and prior work related to scene classification will be discussed. Section 3 will present the research approach for both the object proposal approach and the scene classification task. Implementation details and an elaboration on the experiments are given in Section 4. Section 5 presents the results and their corresponding analysis. Finally, our conclusions and possible directions for future work are presented in Section 6.

10 Chapter 2 Background As discussed in the introduction, this work will focus on object proposal generation and scene classification using RGB-D data. A common task within scene classification is object detection, which will only be discussed briefly. First, related work of object proposal generation is discussed, and then the background involving scene recognition is presented. 2.1 Object proposals In recent years, not just determining what object is being presented in an image is important, but also determining the location of the object. Most state of the art object detectors are designed to classify a single object at a time per given frame. As most images do not display just a single object, it should be possible to detect multiple objects in an image. In order to solve the problem of multiple objects per image, an often used approach is to divide the image into smaller windows in an attempt to capture all possible object locations. The most naive way of solving this problem is by using the Sliding Windows paradigm [3, 24, 25], in which windows (also called bounding boxes) are generated in a predefined grid, in an attempt to cover every possible window in the image. By using the sliding windows paradigm the algorithm is constrained to only use computationally efficient classifiers. Because these approaches generally produce a large number of windows, it would quickly become computationally intractable to apply expensive state-of-the-art complex object detectors, because of the significant computation time required per window. Less naive ways of proposing object windows, have also been suggested in recent years [27 32, 34]. The goal of the object proposal methods is to reduce the number of generated 5

11 Background 6 windows, while keeping the high quality windows. The quality of the object proposals is determined by the probability of the proposal containing an object and how tightly the proposal fits the object. An example of some object proposals in an image is shown in Figure 2.1. If we look at the figure, we can see that green boxes, representing the ground truth boxes, have a tight fit around an object in the scene and that they are preferred over the larger blue boxes which contain much more background pixels. The red boxes, are not capturing any objects, or only a part of an object, and most object proposal methods attempt to remove most of these type of boxes. Figure 2.1: An example of generated object proposals inside a scene. Where the blue proposals should be scored lower than the green proposals (ground truth) because they are not as precise, and the red proposals should receive the lowest score, because they either do not cover an object, or only cover an object partially. Because most object proposal methods filter out the lower quality bounding boxes, they generally generate a lower number of bounding boxes than the traditional Sliding Window paradigm. This reduction in generated bounding boxes allows one to use a computationally more expensive classifier in combination with these boxes to improve the results of for instance object detection pipelines. There is a variety of different approaches for generating object proposal methods. For instance, previous work has shown that a measure of Objectness can be determined for a window, which attempts to rank candidates on the probability that the candidate contains an object [34]. This score is computed by considering local cues, such as edges, corners, and contours. Recently, Cheng et al. [27] presented their Binarized Normed Gradients (BING) algorithm. BING can efficiently compute the Objectness of a window, by resizing the window to a 8x8 windows and taking the gradients to compute a 64D features. As shown in their paper, windows with objects inside them usually share similarities in the 64D feature vector and can be used to measure the Objectness. The above approaches attempt to reduce the number of generated bounding boxes by scoring and thresholding them.

12 Background 7 In other approaches, such as the work of Endres and Hoiem [28], they attempt to reduce this number by applying a different approach to generating these windows. In their paper, regions are generated by placing seeds on various places in the images. For each of these seeds, a separate foreground-background segmentation is performed, which in turn generates the regions that will be considered. The main advantage of this seedapproach is the high quality of the generated proposals, although this comes at a high computational cost. The work of Hosang et al. [35] compares a number of state-of-the-art approaches on a variety of criteria. By limiting the scope to the three best performing methods on recall and detection results, we can split these three methods into two categories: Objectness [36] and Superpixel merging [29, 32]. The approach of Zitnick and Dollár [36] starts with the concept of the Sliding Window paradigm, but instead of testing every possible location, it uses better estimates for the windows in combination with some of refinement of the results afterwards. Their approach differs from previously discussed methods that detected the Objectness score [27, 34]. These two approaches use a variety of cues, such as the edges within the window. The algorithm of Zitnick and Dollár [36] only uses the edges within the window, and in contrast to the previous methods, only tries to make use of edges that belong to a contour that is fully contained within the window. The approaches in the superpixel merging category do not use the Sliding Window paradigm, but instead they start by oversegmenting an image into a high number of small segments, often called superpixels. By generating a large number of segments per image, it is more likely that all pixels within each segment share a certain set of properties. Segmentation has several advantages and disadvantages over just generating the bounding boxes. An advantage for which it is used in general, is the task of semantic segmentation. In the task of semantic segmentation, one does not only want to create segments in the image, but also label these segments. The labels could later be used in tasks such as object detection. The advantage over using bounding boxes is that a proper segmentation can capture even irregular shaped objects more precisely than a bounding box. The pixels inside a segment should only include pixels that lay within the object boundaries, while the pixels inside a bounding box may also include a lot of background pixels. A disadvantage of using segmentation is that the process of creating and storing the image segments takes more time and space to generate than computing the four corners of a bounding box. As the name suggests, the superpixel merging approaches attempt to merge segments together, based on some similarity measure. For instance, Arbeláez et al. [29] introduced

13 Background 8 their Multiscale Combinatorial Grouping (MCG) method, which consists of both multiscale hierarchical bottom-up segmentation [37] and object candidate generation. The resulting segments are merged based on the edge strengths, and the resulting object proposals are ranked based on several cues such as edge strength, location, shape, and size. While this approach is amongst the top scoring approaches on both recall and detection scores [35], its repeatability and computational time per image are worse than the other two approaches. The best scoring object proposal according to comparison of Hosang et al. [35] is the Selective Search [32] algorithm. Although their approach is slower than the approach of Zitnick and Dollár [36], it is able to output segments and not just bounding boxes. Selective Search is also able to run significantly faster than MCG and scores better on the repeatability score [35]. The Selective Search starts with an initial oversegmentation, and iteratively merges these regions until a single region remains. Afterwards, bounding boxes are generated for each of the regions in every iteration. For this thesis, we decided to use a superpixel merging approach because the generated segments are more versatile than plain bounding boxes. For instance, when using RGB-D data, the depth data can directly be extracted from the pixels inside the segments, instead of first having to distinguish the borders of the objects inside a bounding box. Selective Search performs best, out of the presented superpixel merging approaches in the work of Hosang et al. [35]. Hence, this object proposal algorithm will also be used in this thesis. As mentioned in the introduction, we want to use the additional depth channel of RGB- D data to improve the performance of existing object proposal methods for RGB data. In order to extract features from this additional channel effectively, the initial oversegmentation method will also have to use this extra channel. In the following section, we will first discuss the initial oversegmentation method used in Selective Search and some related work, before introducing oversegmentation algorithms that also use the extra depth channel. Afterwards, we will discuss Selective Search in greater detail in Section Image Oversegmentation As mentioned in the previous section, the Selective Search algorithm starts with an oversegmentation of an image. For this initial oversegmentation, they use the graph based oversegmentation algorithm of Felzenszwalb and Huttenlocher [38] to generate the initial regions. In this section, we discuss both this method and another often used oversegmentation method, namely the normalized cuts algorithm [39]. Afterwards, oversegmentation methods based on RGB-D input data are discussed in Section

14 Background 9 Both oversegmentation methods are graph-based methods and are known for their speed and performance. The graph-based methods represents an image as an undirected graph G = (V, E), where v V are the vertices, and in this instance of image segmentation v represents a pixel in the image. The weight of the edge represents the dissimilarity between the connected pixels. (v i, v j ) E represent the pairs of neighbouring vertices/pixels. The normalized cuts algorithm recursively partitions the generated graph using cues like texture and contour. The graph is cut using the eigenvalues of the smallest eigenvectors, and the goal is to minimize their normalized cut criterion. A benefit of this approach is that the number of segments can be directly controlled by this criterion. The graph-based method of Felzenszwalb and Huttenlocher [38] measures the likelihood of a boundary between two regions by measuring two quantities: the intensity differences between connected pixels in each region and the intensity differences across the potential boundary. The intuition behind this approach is that when the difference between the intensity at the boundary and the intensity within the region is large, the boundary is likely to be correct. Unlike the normalized cuts algorithm, the number of superpixels or their size can not be directly influenced by this algorithm RGB-D Oversegmentation The standard image oversegmentation methods are designed to work with RGB images, and in turn do not inherently use the depth channel that RGB-D images provide. For this work, we investigate whether using a segmentation method that uses depth information would give any improvement in the Selective Search algorithm. So far, we only discussed methods that work on RGB images. In particular, we focused on superpixel methods that reduce the number of regions that have to be considered by, for instance, object detection algorithms. Although these methods can certainly be effective, the fact that superpixels have to stay within the object boundaries by using the 2D projected information from the 3D scene, means they do not use all the available information. An example where a color-based segmentation algorithm would have more difficulty with segmenting the image, compared to an algorithm that can use the depth information, can be found in Figure 2.2. In the image, you can see a white table positioned against a white cabinet. For the color-based methods the colors from both surfaces are mostly equal, apart from the shadows, making it harder to segment these surfaces. But if one would use depth information of RGB-D data, the two orthogonal planar surfaces would be easily detected and segmented. Image segmentation using RGB-D images has seen plenty of research in recent years [12, 40 44]. By using the RGB-D data, this geometric information can be easily included

15 Background 10 Figure 2.2: A situation where depth information can greatly improve segmentation. The white table is positioned against a white cabinet, which would make it harder for color-based measures to properly segment the two orthogonal surfaces. in the segmentation process, and the 3D geometric relationships between points can be used to prevent superpixels from crossing the object boundaries. In this thesis, we focus on segmentation methods that can specifically be used to provide good oversegmentation. One of such methods is the Depth Adaptive SuperPixel (DASP) algorithm [43]. DASP is an adaption of a popular RGB superpixel generation algorithm called Simple Linear Iterative Clustering (SLIC) [45]. SLIC is an iterative gradient ascent algorithm which uses local k-means clustering to cluster pixels in the five dimensional space, consisting of the two dimensional location and color. DASP uses the additional depth channel to increase this dimensional space by adding both depth information and the angles of the normals on the geometric surface of each point. DASP is considered a 2.5D method, because it works in the 2D domain, while using some additional information from the depth channel. One of the downsides of these approaches is that the segmentation is performed on a single view, which makes strongly occluded objects difficult to detect. In this thesis, we will use an approach that uses the 3D point cloud of the scene that can be generated from the RGB-D image, namely the Voxel Cloud Connectivity Segmentation (VCCS) algorithm [44]. VCCS can be used to generate supervoxels inside point clouds. In this work, supervoxels are essentially superpixels in three dimensional space, in contrast to other papers where voxels imply extensions of 2D methods to 3D by taking video frames and stacking them to generate the additional dimension [46]. Using a point cloud representation of a scene as input, the supervoxels are generated by using k-means clustering with two geometrical constraints. The first constraint forces the seeding of the clusters to spread uniformly through the 3D space, which ensures that supervoxels are spread evenly throughout the geometry of the scene.

Background 11 The second constraint enforces that all voxels are connected in 3D space, which ensures that they can not merge with voxels, even if they are neighbours in the projected image.

16 Background 11 The second constraint enforces that all voxels are connected in 3D space, which ensures that they can not merge with voxels, even if they are neighbours in the projected image. When two voxels are connected in 3D space, they are considered to be adjacent if they share faces, edges, or vertices. The algorithm starts by dividing the 3D space into a voxelized grid with a resolution Rseed, which effectively translates to the distance between the initial seeds. If seeds are are separated in space from any other seeds, they are removed because they would most likely be the result of noise. By using a small search radius Rsearch, the voxels surrounding the seeds are counted, and if they do not have at least as many voxels as when one would fit a planar plane through the search radius, they are removed. The seeds are relocated to their closest voxel, and the supervoxel clustering starts. For each supervoxel, starting at the seed, the closest voxel is searched in 3D space. If the voxel has not been assigned to a supervoxel, it is added to its closest seed. If a single voxel has been added to a seed, the same process starts for the next seed, and this process continues until either all voxels have reached the lead nodes of their adjacency graph or there are no further voxels without label. Supervoxels are generated by iteratively expanding each seed in 3D space. Hence, labels can not cross over object boundaries. Because all supervoxels expand at the same rate, a similar size for each supervoxel can be expected. The resulting supervoxels can either be used directly or be projected back into the 2D plane, depending on the algorithm that uses them. Figure 2.3 shows an example of the possible outputs for the VCCS algorithm. Figure 2.3: Example of the VCCS algorithm output. Left: original image; Middle: supervoxels with connectivity lines; Right: Resulting 2D projection of supervoxels The initial oversegmentation is just the first step of the Selective Search algorithm. In the following section, the remainder of the algorithm is discussed in more detail. 2.3 Selective Search In this thesis, we chose to extend the Selective Search method [32] for the object proposal generation by adding geometric features. This method is chosen in particular because

17 Background 12 it has been proven to outperform all other current state-of-the-art methods [35]. In addition, it also allows for easy addition of new measures, such as those based on depth features. The Selective Search method starts by oversegmenting an image using graphbased segmentation [38], which is known for its speed and relatively good performance. Selective Search is a bottom-up hierarchical grouping method, that continues combining the two most similar regions until only a single region remains at the top of the hierarchy. The score is determined by a combination of easily computable measures, as opposed to other methods which use a single computationally expensive method as in [29]. Using several different measures has the benefit of diversifying the results, and Selective Search has three areas where it can diversify. First, complementary color spaces can be used, such as RGB, grey-scale, and HSV. This helps with accounting for different light sources as each of the color spaces catches different types of such conditions. In the original paper, the HSV color space produced the best results, hence this shall be considered as default in this work. Second, the starting locations can be varied by altering the parameters of the graph-based segmentation [38], or by choosing a different initial segmentation method. Last, four easy to compute complementary similarity measures are implemented, and these will be discussed in greater detail. The first measure is s color (r i, r j ), which computes a normalized one-dimensional color histogram C i and C j for segment r i and segment r j, where each channel in the colorspace receives 25 bins. In order to compute the similarity between these two vectors the intersection is computed. The normalization ensures the similarity value stays in the range of [0, 1]. The s color measure can efficiently be propagated through the hierarchy by taking the weighted average of the two histograms, compared to the size of each segment. This can be formalized as: C t = size(r i) C i + size(r j ) C j, (2.1) size(r i ) + size(r j ) where C t represents the color histogram of the newly generated region r t. The size of r t is simply the summation of the sizes of r i and r j. The second measure is s texture (r i, r j ), which attempts to capture the texture similarity between two segments by taking the Gaussian derivatives in eight orientations, with σ = 1 for each color channel. For each orientation and color a histogram is created, which are combined and normalized to get the final result. Similar to s color, the similarity score is computed by calculating the intersection between two histograms. The third measure s size (r i, r j ) tries to prioritize merging smaller regions over large regions, as to prevent one large region from merging with all the smaller regions around it. The measure allows for similarly sized regions to merge at each stage of the algorithm, as

18 Background 13 larger regions are punished more than smaller regions. This measure is easily computed by taking the sum of the size of the two regions and dividing this by the size of the entire image in pixels, such that a value in the range of [0, 1] is maintained. The last measure is s fill, which will ensure that gaps are filled, and essentially is a measure of how well two segments fit. By computing the bounding box, around the combination of the two regions, its new size can be compared to the size of the image, similarly to s size. All measures have in common that they return a result in the range of [0, 1], which allows them to be combined, and that only the data inside the two segments have to be combined, when merging two segments, providing the required speed and low computational complexity. The resulting similarity score is computed by: s(r i, r j ) = a 1 s color (r i, r j ) + a 2 s texture (r i, r j ) + a 3 s size (r i, r j ) + a 4 s fill (r i, r j ), (2.2) where a i {0, 1} is a Boolean value that determines whether a similarity measure is used or not. So far the related work of the first contribution is discussed. The second contribution consists of using the additional depth channel to improve scene category classification. The general pipeline of this work can be found in Figure 2.4. As can be seen from the pipeline, the results of the Selective Search could be used by an object detector to localize objects in a scene. For this thesis, we will use the spatial relations between objects for the scene classification to generate a novel depth-based feature, yet implementing an object detector lies outside the scope of this work. Hence, instead of actually using the generate object proposals and using them in an object detector, the ground truth values of the scene are used. Instead of focusing on object detection algorithms, our main focus will be on scene classification algorithms. In the following section, a number of methods for object detection and scene classification are discussed. 2.4 Scene classification In the previous sections, the various approaches to generate object proposals are discussed, of which the resulting object proposals can be used for scene classification. In this section, the focus will be on the second contribution of this work, namely the scene classification task.

Background 14 Figure 2.4: Using an RGB-D image pair, our system segments the image in 3D space using the VCCS algorithm [44] as discussed in Section 2.2.1. Afterwards, this segmentation is used to generate object proposals, which are used to generate the context feature.

19 Background 14 Figure 2.4: Using an RGB-D image pair, our system segments the image in 3D space using the VCCS algorithm [44] as discussed in Section Afterwards, this segmentation is used to generate object proposals, which are used to generate the context feature. This features is used in combination with a depth CNN feature and a RGB CNN feature to train a SVM classifier for the task of scene classification. Until recently, scene classification was mostly performed using methods that work with RGB images. In the past, the general approach of state-of-the-art methods [1, 2] was to use hand-crafted features such as HOG [3] and SIFT [4] features in order to classify objects within the image. The methods generate features based on distinct locations in an image, for instance locations that contain corners, or edges. These features are used to generate a descriptor for the entire image, by which the image can be compared to other images. One of the main benefits of these approaches over simply comparing all the pixels between two images, is that these features have additional properties such as being scale invariant, making them more robust. The usage of these features changed by the introduction of the Bag of Words (BoW) approach to computer vision tasks [47 49]. This is a method that is applied in the field of natural language processing, that works by representing a document as a vector that is filled with the occurrence counts of the words inside. This method was adapted to the field of computer vision by treating local features inside the image as words, and in turn represents an image by the occurrence counts of the local features. The benefit of this method over just using the features as an image wide descriptor is that local features can now be used to classify an image on different levels. One of the downsides of the BoW technique is that any spatial information regarding the features is lost, as only the occurrence counts are stored. A potential solution to this loss of spatial information was the introduction of spatial

20 Background 15 pyramids [50]. This method attempts to solve the loss of spatial information by dividing the image into several regions in a hierarchical fashion. Where at the bottom you will find many small regions that are slowly concatenated into a region that covers the entire image. By computing a BoW for each of these separate regions and computing a weighted combination of these regions, the spatial information is maintained while also being able to benefit from the features of BoW techniques. An adaptation of the spatial pyramids has also been published recently [51]. Their approach uses RGB-D images in combination with spatial pyramids and two depth descriptors called NARF [52] and PFHRGB [53] to classify scene categories. Today, the main approaches towards tackling the scene classification tasks make less use of such features, and appear to be oriented towards the usage of Convolution Neural Networks (CNNs). As this work will also use a CNN, the methods involved will be discussed in greater detail below. 2.5 Convolutional Neural Networks Recently, a different approach to scene classification is being taken, namely the usage of Convolutional Neural Networks (CNNs), as can be seen in challenges such as the ImageNet challenge [54]. CNNs have become highly involved in training scene classifiers, and find their origin in Neural Networks. In 2012, a deep convolution neural network was submitted to the ImageNet challenge [55], which won that years challenge and largely influenced the entries of later years. CNNs are based on the concept of feed-forward Neural Networks. Where we define a neural network to be a system built from layers that each have a certain number of neurons. Each of these neurons can be connected with neurons from the next layers, and all have a specific weight. In general, a neural network always has an input layer and an output layer, with optional hidden layers in between. The feed-forward neural network is a type of neural network where the links between the different layers only point from the input layer, through the hidden layers, towards the output layers, which results in all information going just one way. A simple illustration of such a network can be found in Figure 2.5. Because layers in neural networks are fully connected, image data can quickly give an explosive growth in the number of weights that have to be learned. Take for instance a 64x64 RGB image, this will result in each node in the first hidden layer to have at least weights, making standard neural networks unsuited for the task of scene classification.

Background 16 Figure 2.5: Illustration of a feed-forward neural network, with 3 input nodes, 2 hidden layers with each 3 nodes, and an output layer with 3 nodes.

Convolutions are essentially filters, that can be used to detect local features such as edges and corners.

21 Background 16 Figure 2.5: Illustration of a feed-forward neural network, with 3 input nodes, 2 hidden layers with each 3 nodes, and an output layer with 3 nodes. CNNs differ from standard neural networks, because CNNs have layers that consist of convolutions. Convolutions are essentially filters, that can be used to detect local features such as edges and corners. A CNN can safely apply convolutions to the input data because the explicit assumption of CNNs in the field of computer vision is that it will consist of images. A standard neural network is not specifically designed to accept image data and one can not assume that convolutions can always be applied to the input. As a result, a CNN can specifically be designed to efficiently process image data, and different types of layers can be used, each having a specific task. The general observation with CNNs is that each consecutive layer will trigger on more specific elements in an image. For instance, the first layer will find differences in the gradient. The second layer might find edges and corners, and further layers would trigger on the general shape of objects such as a car or a chair. An illustration of these findings can be generated by visualizing the individual layers in a network [56]. A small example of such visualizations can be found Figure 2.6. Figure 2.6: Visualization of the first two layers of a CNN, taken from the work of Zeiler and Fergus [56]. In general, the filters of the layers become more complex as more layers get added. A typical CNN is built up from a set of different types of layers, each with a specific task. The remainder of this section will describe some of the categories of layers used in

Background 17 previous work [6, 8] and describe the used CNN for this work. 2.5.

22 Background 17 previous work [6, 8] and describe the used CNN for this work Convolutional layer The convolutional layers consist of a set of learnable filters, and they form the core building block of CNNs. Each filter has weights that can be learned, which form the parameters of the layer. Each filter in this layer only covers a small spatial area of the image, also called the receptive field, but will cover all channels of an image. For RGB images these channels are usually represented by the three color channels. During the forward pass, the filter is convolved across the width and height of the image, and the dot product is computed between each filter and the input at all the visited positions. This process produces a 2D activation map, that presents the activation of each filter across the image. The goal of the learning process for the parameters of a convolutional layer is to have each filter correspond to a specific type of feature in the image. A convolutional layer typically has four hyper parameters, First, we have the receptive field size, which controls the size of the filter. Second, the depth of the layer has to be determined, which is equal to the number of filters that will be used in the layer. Third, the stride by which the filter is moved across the image. Last, the amount of zero padding, which allows you to control the output size Pooling layer Each convolution layer used in the network will add more parameters that will have to be learned. In order to reduce the number of parameters, which in turn would reduce the computational costs of training the network, a pooling layer can be added. This layer reduces the spatial size of the representation by taking the maximum operation over a small region that is moved over the representation. For instance, if one takes the maximum over a 2x2 region, the size over that region is effectively reduced by 75%. An illustration of the effects of a max pooling layer is shown in Figure 2.7. Figure 2.7: Illustration of a pooling layer in a CNN, with a 2x2 filter and a stride of 2

23 Background Fully connected layer Typically a CNN ends with fully connected layers, because similar to normal neural network these can easily be used to give scores to different classes in a classification task. The main difference between the fully connected layer and a convolutional layer is that the convolutional layer is only connected to a local area of the input and that many parameters of a convolutional layer are shared, while the parameters in a fully connected layer are not GoogleNet For this work, we will use the CNN that was used to win the ImageNet Challenge 2015, namely the GoogleNet [8]. This network is a deep CNN that was used in the object detection challenge, that has been trained to classify different object classes. One of the key difference with other approaches is that the layers are not just applied linearly but parallel. A full description of this network lies outside the scope of this work, and for that we refer to the original paper. The general pipeline of GoogleNet can be seen in Figure 2.8. Figure 2.8: Representation of GoogleNet, as provided by the authors [8] 2.6 Features using context While a CNN is able to capture specific patterns, it is still only a local pattern. What we would want for scene category classification is a method that does not just extract features from local information, but also from global information. This global information, has also been indicated as contextual information in previous work. Contextual

24 Background 19 information is interpreted as any available information in the image that can influence the perception of the scene and the objects it contains [57]. The work of Divvala et al. [58] and Galleguillos and Belongie [59] each present an overview of the different types of contextual information. The overview of their paper also included contextual information that can be gathered from outside an image such as cultural and geographic context. In this thesis, we will only consider the context that can be provided by the image itself. Contextual information that can be derived from data available inside an image are, for instance, 2D scene gist context [60, 61], 3D geometric context [62, 63], and semantic context [19, 21, 64]. Torralba [60] showed that a global feature, which he called a gist, could predict the presence of an object and its location. In the work of Gupta et al. [62] they show that the localization of objects can be improved by predicting the location and size of the support surfaces in an image. Rabinovich et al. [19] demonstrated that certain objects are more likely to co-occur in an image. A somewhat similar approach of Farhadi and Sadeghi [21] proposed to model objects in common conjunctions such as person riding a horse, instead of identifying the objects person and horse individually. Hence, taking the research regarding the effectiveness of contextual information into account, we can conclude that the addition of this contextual information is a useful contribution to a scene classification pipeline. The approaches of Rabinovich et al. [19] and Farhadi and Sadeghi [21] are both heavily influenced by the performance of the underlying object detection. If the object detection does not capture all objects in the image, the quality of their methods also deteriorates. By adding these two additional measures to Selective Search we hope to improve the object proposal generation, which in turn could be used to improve an object detection algorithm. The proposed new feature of this work use the co-occurrence between the objects in the scene, like the method of [19], but instead of just keeping count of the co-occurrences in a scene, the 3D spatial relations are also considered. These 3D spatial relations are derived using the additional depth information of the RGB-D data. In the following section we discuss the new measures for Selective Search and present an application of the improved object proposals in the scene classification task.

25 Chapter 3 Methodology As discussed in the introduction, the contributions of this thesis are two-fold. The first contribution attempts to improve an object proposal method called Selective Search, by adding depth-based features. The second contribution improves the scene recognition task by using newly created depth features in conjunction with an already existing classifier. 3.1 Extending Selective Search In an attempt to improve the Selective Search algorithm [32], we explore whether replacing the RGB based Graph-based oversegmentation [38], that is used to generate the initial regions, with a 3D based segmentation method would improve the results of the Selective Search method. The graph-based segmentation generates superpixels, which essentially are clusters of pixels that are similar according to a certain measure. In this thesis, we want to use the 3D equivalent of superpixels, namely supervoxels. For the replacement of this Graph-based segmentation, the VCCS algorithm is taken. By using this algorithm, it is now also possible to use the depth channel of the RGB-D data, instead of just RGB. Because Selective Search can only use a 2D representation for the labeling of the images, the resulting point cloud of the VCCS algorithm is projected back into the original view. Because of this projection some artifacts, such as small clusters of pixels of a label floating in a different segment, are left in the image. In order to remove these artifacts, a median blur is used to smooth them out. This results in a cleaner segmentation, and an example of this median blur is displayed in Figure 3.1. By using the additional depth information, two novel similarity measures that are based on depth features are implemented. These additional measures are combined with the 20

Methodology 21 (a) Unsmoothed segmentation (b) Smoothed segmentation Figure 3.1: Example of an initial segmentation showing the difference between an unsmoothed and smoothed image respectively.

four already existing similarity measures of Selective Search that are described in Section 2.3.

26 Methodology 21 (a) Unsmoothed segmentation (b) Smoothed segmentation Figure 3.1: Example of an initial segmentation showing the difference between an unsmoothed and smoothed image respectively. Most of the specks that can be found in the left image, have been smoothed out in the right image using a median blur. four already existing similarity measures of Selective Search that are described in Section 2.3. As a result, the algorithm now has the four standard measures: S color, S texture, S size, and S fill, and the two new measures: S voxel and S distance. These two new measures can either be used weighted separately, or grouped by color and depth features. Next, the two new features are described in more detail. S distance (r i, r j ) measures the Euclidean distance between the centers of two segments r i and r j in 3D space. The measure is represented as the fraction of the maximum possible distance within the scene, resulting in the score staying within a range of [0, 1], which is in line with the already existing measures. The maximum distance within an image is generated by creating a 3D bounding box around the point cloud of the image, and taking the Euclidean Norm (also known as 2-norm) of the two diagonally opposite points. This value is inverted by subtracting the result from 1, such that a small distance gives a high score, while a large distance gives a low score. For the (x,y,z) locations c i and c j that represent the center of the supervoxels r i and r j respectively, this gives: S distance (r i, r j ) = 1 c i c j 2 max(s distance ) (3.1) = 1 [ m,n abs((c im c jn ) 2 )] 1/2, (3.2) max(s distance ) where the Euclidean distance is divided by the maximum distance to return a ratio within the range of [0, 1]. S voxel (r i, r j ) is a simpler measure that uses the adjacency matrix retrieved from the VCCS algorithm to either return a 1 if an adjacency link is present between supervoxel

27 Methodology 22 r i and r j in the point cloud representation or a 0 if this adjacency link is missing. For two segments r i and r j, this results in: 1 if voxel link(r i, r j )exists S voxel (r i, r j ) = 0 otherwise (3.3) With these new features, there are now two possible ways to combine the new depthbased measures with the original color-based measures. The first approach is to just weigh both color measures and depth measures equally. If the user chooses to use a measure, it will have a weight of 1 and else 0. This results in a simple extension of Equation 2.2: s(r i, r j ) = a 1 s color (r i, r j ) + a 2 s texture (r i, r j ) + a 3 s size (r i, r j ) + a 4 s fill (r i, r j ) + a 5 s distance (r i, r j ) + a 6 s voxel (r i, r j ), (3.4) where a i {0, 1} describes whether the similarity measure is to be used or not. A potential issue with equally weighing the measures, is that the original color based measures of Selective Search will always have a majority over the depth measures because there are simply more measures, resulting in a potentially biased final measure. A simple solution is to weigh the combination of color and depth features separately and combine the results, this would give: s(r i, r j ) = a 1s color (r i, r j ) + a 2 s texture (r i, r j ) + a 3 s size (r i, r j ) + a 4 s fill (r i, r j ) a 1 + a 2 + a 3 + a 4 + a 5s distance (r i, r j ) + a 6 s voxel (r i, r j ) a 5 + a 6. (3.5) The results of altering the equation to weigh color features and depth features equally are presented in Section 5. If we look at the pipeline in Figure 2.4, we can see that the next step in the pipeline is to use the generated object proposals for the object detection task. In this thesis, the object detection step is not performed. Instead, the ground truth values for the objects in the image are used. This will allow the novel depth-based feature of the second contribution of this work to use the information of the objects in the image without the additional noise that is introduced by object detectors. In the following section, we elaborate on

28 Methodology 23 how the novel depth features are used in combination with the CNNs for the task of scene classification. 3.2 CNN with additional depth features In this thesis, we explore if there are other features that could be extracted from the image that make use of the additional depth channel. We have already seen the HHA features [7], which use features generated on a superpixel level. These features are still mostly derived from the two dimensional representation, making them 2.5D features, as explained by the authors [7]. For this thesis, we implement a feature that uses the location of the detected objects within the scene, in 3D space. For this feature, a strong assumption is made, regarding the fact that there is a good, or preferably perfect, object detector to detect all the objects in an image. The feature is structured as a 3D co-occurrence matrix with all objects that will be compared. In this thesis, we use all objects that are present in at least 50 unique images in the dataset. This results in a total of 209 object classes remaining in this dataset. For all the objects in an image, the 3D location is computed using the depth data. Next, the distances between all objects, other than with themselves, are computed for each image. As the avoid having to measure on a continuous scale, the distance between objects is discretized into a histogram of ten bins, which have a size of 50 centimeter per bin up to 4.5 meter in the first nine bins. The last bin contains all distances greater than 4.5 meter, this particular value is selected because the inaccuracy of most sensors grow when nearing, or going past this distance. The result is a 3D matrix with the objects on the x and y axis, and the histogram with 10 bins on the z axis. If computed naively this results in n 2 elements for n objects per image. However, the distance between two objects is equal, regardless of whether you compare object o i with o j or vice versa. Hence, only one half of the diagonal of the co-occurrence matrix is used, which reduces the number of elements to n(n+1) 2. The last step of generating the feature is to flatten it, because many implementations of neural networks do not accept 3D features. The matrix is flattened per row, where all histograms on the row are concatenated, to be combined with the other rows later, this results in a ( n(n+1) 2 #bins) sized vector. Similar to the work of [7] the resulting features are used to train a one-vs-all SVM classifier [65]. By design SVM classifiers are fundamentally a two-class classifier, yet the problem of classifying K > 2 different classes occurs often. A common solution to this problem is to implement K different SVM classifiers, where k K and each classifier

29 Methodology 24 SV M k classifies the k th class as positive, and all entries that belong to other classes as negative. The final output of the combination of the SVM classifiers is the class belonging to the SVM that produces the highest score for a given entry. Before the RGB and depth CNN features can be used optimally for the task of scene category classification on the SUN RGB-D dataset, they have to be fine-tuned towards this specific task and dataset. This will have to happen before using them in the SVM classifier and the steps involved with this process will be discussed below Fine-tuning CNN Most recent research has been oriented towards object classification, while only a small portion is dedicated to specifically classifying the scene category of the image. Hence, when using an CNN that has been trained to classify a certain set of objects, this will have to be fine-tuned towards classifying a set of scene types. A benefit of using an already existing CNN, although trained towards different, but similar, data is that a properly working model can be achieved with relatively little amount of data. When working with deep CNNs such as GoogleNet, a dataset like SUN RGB-D with around RGB-D images is usually not considered to be enough to train the weights of a CNN from scratch. Recent work has shown that it is more effective to train a CNN using the weights of an already existing CNN, which weights have been optimized for a closely related goal, over training from scratch with little data [7, 66]. For instance in the work of Xia [66], the AlexNet is used to classify styles in an image, while the network was originally trained to classify objects. Gupta et al. [7] showed that a CNN model trained on RGB data can be fine-tuned to create features based on depth data, which they called HHA features. In order to fine-tune the GoogleNet model to classify scene categories instead of objects, the last few layers of the model have to be adjusted. Taking a closer look at the last layers of the pipeline displayed in Figure 2.8, we see that the final layers consist of a fully connected layer (represented as a convolution layer in the image) and a softmax classification layer. The GoogleNet was originally trained to detect different object categories, while the network will be used to detect 21 different scene categories in this work. Hence, the number of outputs of the final layer are changed from to 21, and the weights of the loss functions are removed, such that these can be retrained from scratch. This allows the model to adjust its parameters to have it classify the 21 scene classes instead of the 1000 object classes. After the CNN is fine-tuned, the novel depth features can be added to the pipeline.

30 Chapter 4 Experiments In this section, we first discuss the specifics of the dataset and the implementation details in Section 4.1 and Section 4.3 respectively. Next, the experiments for the extension of the Selective Search algorithm are discussed in Section 4.4. Finally, the experiments for the scene category classification task are described in Section Dataset In this thesis, the SUN RGB-D V1 dataset is used [13]. The dataset contains RGB-D images from three other datasets, namely NYU depth v2 [12], Berkeley B3DO [67], and SUN3D [68]. In total, the SUN RGB-D V1 dataset contains RGB-D data generated from four different sensors. Figure 4.1 displays the quality differences per sensor, and Table 4.1 shows the distribution of the images over these images. The dataset is composed of four folders, one for each sensor. In total the dataset contains entries with depth data, and these images are used to create the test- and train set used in this thesis. Sensor Initial #images #images after noise removal #Images after threshold Intel Realsense Asus Xtion Kinect v Kinext v Total Table 4.1: The sensors used to generate the SUN RGB-D dataset [13], and the number of images per sensor. First column: images remaining after filtering missing depth information. Middle column: images remaining after also removing the noisy classes, such as idk and furniture store. Last column: the number of images that remain after applying a minimum threshold on the scene class occurrences. 25

31 Experiments 26 Figure 4.1: Comparison of the four RGB-D sensors. The raw depth map from Intel RealSense is noisier and has more missing values. Asus Xtion and Kinect v1s depth map have observable quantization effect. Kinect v2 is more accurate to measure the details in depth, but it is more sensitive to reflection and dark color. [13] Properties Each entry contains various properties, as displayed in Table 4.2. The table does not display all available properties in the dataset, but only the ones used in this work. For instance, the extrinsics are not used because the tasks at hand do not require combining multiple images or point clouds. The 3D annotations are also ignored, because they are incomplete and only consist of bounding boxes for some of the objects in the room. Instead, the 3D annotation are re-computed using the 2D annotations in combination with the point cloud that is generated by using the depth information and the intrinsic parameters for each image. 4.2 Data pre-processing and Analysis In order to be able to use the dataset for scene type classification using supervised techniques, one first has to determine the number of unique scene types in the dataset. The first analysis shows two odd classes that can not be used in this thesis: idk and

Experiments 27 Property Image Depth Depth bfx Intrinsics Scene Annotation2D Description 3-channel 8-bit.jpg RGB image 1-channel 16-bit.

txt file with the intrinsics matrix from the used sensor..txt file containing the ground truth class of the scene.

The entries labelled with the idk class are removed because this label is an abbreviation for I do not know, hence this is seen as data without correct ground

(a) idk scene (b) idk scene Figure 4.2: Samples of the removed idk class.

the actual shown scene type. Hence, these entries are treated as noise, and they are removed. Some examples of the furniture store class are shown in Figure 4.

(a) Furniture store bedroom (b) Furniture store bedroom (c) Furniture store kitchen (d) Furniture store kitchen Figure 4.

32 Experiments 27 Property Image Depth Depth bfx Intrinsics Scene Annotation2D Description 3-channel 8-bit.jpg RGB image 1-channel 16-bit.png image, each pixel represents the measured depth 1-channel 16-bit.png smoothed depth image, to cover missing values.txt file with the intrinsics matrix from the used sensor..txt file containing the ground truth class of the scene..json file contains the ground truth of the objects in the scene Table 4.2: Properties of an entry in the dataset that are used in this thesis. furniture store. The entries labelled with the idk class are removed because this label is an abbreviation for I do not know, hence this is seen as data without correct ground truth. If you look at some examples of this class in Figure 4.2 it is evident that these images lack any context by which a room category could be derived. (a) idk scene (b) idk scene Figure 4.2: Samples of the removed idk class. The furniture store entries are removed because the contents of these images are showroom models of various scene types, which the classifier would mistake for the actual shown scene type. Hence, these entries are treated as noise, and they are removed. Some examples of the furniture store class are shown in Figure 4.3, where the scene could easily be perceived as either a bedroom or a kitchen. (a) Furniture store bedroom (b) Furniture store bedroom (c) Furniture store kitchen (d) Furniture store kitchen Figure 4.3: Samples of the removed furniture store class, where they have the appearance of either the bedroom class or the kitchen class. The class analysis shows that there are 45 unique classes present in the dataset, when the idk and the furniture store classes are removed. Although the classes are not equally represented within the dataset. In order to ensure that all classes are represented in both train- and test set, a minimum threshold of 50 total occurrences is used. This results in

33 Experiments 28 a dataset of 21 unique classes, and the distribution of these classes is displayed in Figure 4.4. Figure 4.4: Scene Category distribution of the classes occurring more than 50 times, resulting in a total of images. 21 out of original 45 scene categories remained. The train- and test set used in the object proposal task, are obtained by randomizing the images and splitting 90% into the train set, and 10% into the test set. For the fine-tuning task we do not want to train a biased R-CNN model. As Figure 4.4 clearly shows this skewness in the data, the train set for this specific task is given a threshold of a maximum of 150 occurrences per class, as to remain in range of the minimum 50 class occurrences, if a class occurred more than 150 times, the remainder is inserted into the test set. For the scene category classification, we use 3-fold cross validation to train and test the resulting SVM classifier. Each of the folds is guaranteed to have an equal distribution of classes by dividing the images over the three sets per class. As a final step the order of the images in the train set are randomized, such that overfitting in the early stages of the training stage of the SVM classifier can be avoided. 4.3 Implementation The implementation of the object proposal generation and the scene category classification is performed using Python and C++ on a linux server. For this project the OpenCV 2, Point Cloud Library (PCL) 3, Selective Search Python library 4, and Caffe

34 Experiments 29 library [69] are used. The OpenCV library is used to apply image processing on the RGB images, and the PCL library is used with the depth data to both generate and perform operations on 3D point clouds. The Caffe library is a framework for deep learning neural networks, and also includes various pretrained CNNs, such as the GoogleNet and the AlexNet [6]. Finally, the Selective Search Python library is an unofficial version of the original implementation in Matlab Extended Selective Search Previous work has shown the importance of having a low number of object proposals [33]. Hence, in the first experiment we study potential ways to reduce the number of generated bounding boxes when using the novel depth features. For this task, the scores of all measures are evaluated separately per image. Using these observations, a threshold is set for the depth features. For the second experiment, the colorspaces are varied for all combinations of color measures with the novel depth measures. For the experiments in this thesis, the RGB and HSV colorspace are tested. Following the results of a comparison between five different colorspaces for the task of image segmentation [70], the hypothesis is that the HSV colorspace will outperform the RGB colorspace. For the default Selective Search the parameters are kept to their default values, resulting in using all color measures and setting the parameters of the original oversegmentation method to σ = 0.8, and k = 100. Where σ is the amount of smoothing applied to the image and k is the value of the threshold function. All other variants of Selective Search use the VCCS algorithm with a voxel resolution of 0.008, a seed resolution of 0.1, the color importance set to 0,2, the spatial importance set to 0.4, and the normal importance set to 1.0. All tested combinations of measures apply all default color measures and a subset of the novel depth features. In the third experiment, the possible problem of a biased algorithm is addressed. The original equation is presented in Equation 3.4, in which the color measures outnumber the depth measures. As it is not certain whether this will influence the results, we propose Equation 3.5. In this equation, the combination of color measures and combination of depth measures are weighed equally. Which could remove, if present, a bias in the Selective Search algorithm. The final experiment explores different values Intersection-over-Union (IoU) threshold values, which is the threshold that determines whether a generated bounding box matched a ground truth value. For this experiment, all bounding boxes generated during 5

35 Experiments 30 the run are stored with the highest overlap score with a ground truth box. This list is later used to generate the results. 4.5 Scene Classification For the Scene classification task the dataset is split into a train- and test set. The train set consists of 90% of the data, and the test set contains the remaining 10%. For these tests scene classes that occur more than 50 times are used, which resulted in 21 scene categories. An equal distribution in the train- and test set is guaranteed by splitting each of the classes separately. The Scene classification process includes three steps. First, the CNNs that are involved for both RGB and HHA features are fine-tuned on the dataset. Second, the RGB, HHA, and/or the new depth features are computed based on the hyper-parameter that decides which features will be tested. As a result different combinations between the features are tested. The last step is to train the SVM using these features and pass the test set through the trained SVM to compute the results. For the process of the fine tuning of the R-CNN for the SUNRGB-D dataset [13], and task of scene type classification, the dataset is split into a train-, validation-, and test set. In order to prevent bias for the most occurring classes, each scene class is limited to a maximum of 150 and a minimum of 50 for the train- and validation set, where the remainder is added to the test set. As a final step before the fine-tuning process the order of the entries in the train set is randomized. The fine-tuning process of the R-CNN model is performed on the Distributed ASCI Supercomputer 4 (DAS-4) server [71]. The fine-tuning process is run for iterations, which is in line with the R-CNN paper [7]. For the remainder of this work it can be assumed that all CNNs that are mentioned will be fine-tuned. The CNNs are trained in a similar way, although the input is different. For the HHA features we use AlexNet [6] instead of GoogleNet, because the original authors only tested it s functionality for that specific CNN model, and they made no claims whether it would work on just any other CNN. In both cases the weights of the last two layers are fine-tuned to model scene categories, instead of object classes. The GoogleNet results in a feature vector with entries, and the AlexNet returns a vector of entries. Both are independently normalized using the L2-norm. After this step they are usable for the SVM classifier. For the novel depth features, the objects that are going to be used to train the model, have to occur at least in 50 unique images according to the ground truth values. This

36 Experiments 31 method is chosen to remove any potential noise generated from a sub-optimal object detector. Another source of noise are the ground truth annotations themselves, or we have to assume that they are generated with knowledge that can not be derived from the image itself. For instance, objects near the borders of the image, had annotation that went outside the borders of the image. This problem has been solved by taking the closest pixel inside the image for every pixel that lies out of bounds. Finally, we explore the effectiveness of the proposed depth feature, when compared to another context based feature. The proposed depth feature essentially extends a co-occurrence matrix of the objects within the scene, and splits the number of cooccurrences between two objects across a histogram of ten bins. Hence, in order to test whether the performance improvements is mostly influenced by the object occurrence counts or the actual splits across the ten bins, we compare the proposed depth feature against a normal co-occurrence context feature. Which can essentially be seen as the proposed depth features with a single bin per object-object pair. In the context of the pipeline this features still requires a proper functioning object detector, and it can serve as another example of an application for the improved object proposal generation.

37 Chapter 5 Results In this section, we will first discuss the results of the various experiments involving the object proposal generation. Afterwards, the experiments involving scene category classification will be discussed in Section Extended Selective Search Throughout this section, the abbreviations CTSF-1, CSTF-2, CTSFA, CTSFV, and CTSFVA are used. These are the abbreviations used to mark down the applied measures in the extended Selective Search, where the color-based measures are described in Section 2.3, and the depth-based measures are described in Section 3.1. The abbreviation can be decoded as follows: (C)olor (T)exture (S)ize (F)ill (V)oxel (A)djacency, which each match to their corresponding measure. CTSF-1 is the default Selective Search algorithm and CTSF-2 uses the same measures but uses the VCCS oversegmentation algorithm. Before discussing the results of the experiments, we will start by elaborating on the used evaluation metrics for this section, namely Recall and MABO Evaluation Metrics For the task of object proposal generation, we are mainly concerned with reducing the number of proposals and maintaining or possibly improving the recall over the default Selective Search approach. Next, the quality of the proposals is also important, preferably we would want bounding boxes that are similar to the ground truth, and have a high overlap. The metrics used for this task are the recall metric and the Mean Average Best Overlap metric (MABO), as used by the original Selective Search paper [32]. 32

38 Results 33 The recall metric is straightforward. If a generated bounding box has an overlap score above a certain threshold t, it is classed as positive, and otherwise as negative. The Overlap score [11] for a generated hypothesis h and a ground truth box g is given by: Overlap(h, g) = area(h) area(g) area(h g) (5.1) The recall is computed by iterating over all the ground truth boxes, and checking whether there is a hypothesis that has an overlap score above the used threshold. It is important to note, that not the number of matching hypotheses are measured, but only whether there is at least a single match for a ground truth box. The recall score is given by: Recall = #Correctly classified ground truth boxes #Total ground truth boxes (5.2) One of the downsides of this method is that, although you know that there are matching hypotheses for the ground truth, you do not know whether they are all barely above the threshold or near perfect matches. The MABO metric attempts to capture the overal quality of the generated proposals, by keeping track of the maximum overlap of each ground truth in each images and taking the average over these values. The Average Best Overlap (ABO) score for a given object class c is given by considering all ground truth boxes of this class gi c Gc, and all hypotheses h j H is given by: ABO c = 1 G c g c i Gc max h j H Overlap(gc i, h j ), (5.3) the MABO score is given by computing the mean of the ABO scores of all classes: MABO = 1 G G c G ABO c. (5.4) Reducing segment count One of the possibilities to improve the results of Selective Search, by using the additional depth features, is to drastically reduce the number of generated segments, while having a much smaller change in recall. In order to investigate the effects of the additional features, the similarity scores between two segments of the new and old features are compared on the images inside the train set. The findings show that a value above 0.97 for the S distance measure, resulted in poor bounding boxes around the segments. Hence, this value is used as the threshold for this measure, and all object proposals with a score

39 Results 34 above this threshold are discarded. Table 5.1 displays the results of a run on the testset. The results show a relative drop of approximately 9% in recall and 8% in MABO, while the bounding boxes have a more significant drop of approximately 65% overall. The results appear to indicate that, by using the additional depth measures, a similar level of performance can be achieved with significantly less object proposals for this dataset. From these results we can derive that a high number of false positive bounding boxes are removed, while only a few true positive boxes are discarded. The decrease in MABO is highly correlated to the decrease in recall, seeing how the ground truth box that is now no longer sufficiently (above the threshold) covered by the bounding boxes will also have a lower MABO score. A possible explanation for the effectiveness of this threshold is the fact that a small distance between the center of two segments in 3D space, generally implies that it can be captured by a small bounding box. In this dataset there are just few objects that have such a small size. The decrease in recall and MABO is most likely a result of these small objects in a scene no longer being sufficiently overlapped by a generated bounding box. Precision Recall MABO Avg. Box count RGB CTSF RGB CTSF RGB CTSFA RGB CTSFV RGB CTSFVA HSV CTSF HSV CTSF HSV CTSFA HSV CTSFV HSV CTSFVA Weighted HSV CTSFA Weighted HSV CTSFV Weighted HSV CTSFVA Table 5.1: Results of extended Selective Search using the RGB and HSV color spaces, with an IoU threshold of 0.5. The best scores per category and are bolded. For the Precision, Recall, and MABO scores higher means better, while the Avg. Box count is better when it is lower. The weighted HSV CTSFVA mask shows the best results for Recall and MABO, while the weighted HSV CTSFV mask has the highest scores on all three categories when compared to the number of generated boxes Varying Colorspaces In order to test the effects of using different colorspaces, two frequently used colorspaces are tested, namely RGB and HSV. The results of both attempts are shown in Table

40 Results When comparing the two colorspaces, it becomes evident that HSV outperforms the RGB colorspace only just slightly for most combinations. The original Selective Search (CTSF-1 ) is the exception to these findings. This is a striking result, because the original paper also uses the HSV colorspace over the RGB colorspace, while using the same parameters. Previous research has also shown that the HSV colorspace should outperforms the RGB colorspace for the task of image segmentation [70]. An explanation for this significant difference in performance between HSV CTSF-1 and the other HSV masks could be the different types of scenes used in this thesis and the original paper. In the work of Uijlings et al. [32] they use the Pascal VOC 2007 test set [72], which includes both indoor and outdoor scenes, while in this work we only use indoor scenes. Outdoor scenes differ from indoor scenes. For instance, objects in outdoor scenes are often spaced further apart, simply because there is more room available. There are also different object categories such as trees, birds, cows, and bikes that are often not found indoors. The SUN RGB-D dataset uses different object classes and possibly the sizes and the characteristics differ too much when compared to the objects in the Pascal 2007 test set. One of the possible explanations for the performance increase for the other masks, when using the HSV colorspace, is that HSV is more effective for the used measures in this work. Because, the depth measures are not influenced by a change in colorspace, this would leave the color-based measures. A potential cause for the increase in performance for the HSV colorspace can be found in the differences between RGB and HSV. One of the major differences is that the HSV colorspaces splits the image intensity from the color information. For instance, having a shadow on a plane of red, would result in a different distribution of red, green, and blue across the plane in a RGB colorspace, while the HSV colorspace would have a similar Hue and Saturation value, but a different color intensity value. Because the color measure computes the difference between color histograms, where each channel is treated separately, the histograms would appear more similar if 2 channels remained the same, instead of all channels changing, as would be the case with RGB Weighing depth features Extended Selective Search has four color measures and two depth measures, which could potentially leave the algorithm biased towards favouring color measures. Hence, the algorithm is adapted to weigh the combination of color measures and the combination of depth measures equally to see if this even balance provides any significant difference when compared the original balance. The results are presented in Table 5.1.

41 Results 36 From the results it is clear that the adjacency measure is not precise enough to be used in the weighted version. If we compare HSV CTSFA to weighted HSV CTSFA we see that they show similar performance. And when we compare emphhsv CTSFVA to weighted HSV CTSFVA we see that the precision decreases, while the recall, MABO, and number of boxes increases. When we look at the original goal of improving the recall and MABO scores of the CTSF-2 approaches, you could say that the weighted HSV CTSFVA scores the best on both recall and MABO scores out of all tested approaches, while generating slightly less bounding boxes than the color-based CTSF-2 approaches. When comparing HSV CTSFV to weighted HSV CTSFV we can see that the performance of the latter drops slightly, while also having a minor reduction in the number of bounding boxes. Although the performance of weighted HSV CTSFV in all three metrics is lower than the performance of HSV CTSFV on an absolute scale, we do see that the weighted HSV CTSFV performs relatively best out of all tested methods when we compare the scores to the number of generated object proposals. Because one of the goals of this work is to use the additional depth data to reduce the number of object proposals while maintaining similar performance, we can conclude that the weighted HSV CTSFV is the best option for this goal Varying IoU overlap threshold As discussed in Section 5.1.1, the definition of a positive match for the recall metric is linked to a threshold value that is compared against the overlap value of the generated proposal with the ground truth. In order to test the influence of this parameter the threshold is tested over the range of 0% overlap to 100% overlap with a step size of 1%. The results are displayed in Figure 5.1. The first thing that can be derived from the graph, is that the CTSF-1 performance quickly drops compared to the other masks. For the other masks you can see that altering the IoU threshold value has the most effect on the masks that use the Voxel measure, between the range of 0.1 and 0.3, where we witness an approximately 10% difference with CTSF-2 and CTSFA, while the difference beyond that point is below 5%, and eventually converges. A reason for this difference at the lower threshold values could be contributed to the higher number of generated object proposals for the CTSF-2 and CTSFA masks, in comparison to the CTSFV and CTSFVA masks. Table 5.1 shows that CTSF-2 and CTSFA use significantly more boxes, while the difference between the recall scores is less significant. An explanation for the differences in the number of generated boxes and the relatively small change in the recall score is that the additional object proposals

42 Results 37 Figure 5.1: The IoU overlap threshold plotted against the recall using the results of the HSV colorspace. The CTSF-2 and CTSFA mask almost completely overlap each other, a similar pattern can be seen with the CTSFV and CTSFVA mask. either are not precise enough, and drop below the 0.5 overlap threshold value, or that many of the additional boxes all classified the same ground truth box, which does not increase the recall score. At the other end of the threshold range, we can see that all methods apart from CTSF-1 appear to converge at an IoU threshold of approximately 0.8, which could indicate that these measures share the same top ranking bounding boxes. When we compare CTSF-1 to CTSF-2 we can see that by changing the initial oversegmentation method, an increase in recall can be realised. Although, by looking at Table 5.1, we can see that CTSF-2 generates significantly more bounding boxes. If we compare CTSF-1 to the masks that use the depth measures, we can see that the performance below the 0.4 threshold is worse for those methods, but that CTSF-1 quickly performs significantly worse as the threshold is increased. This can indicate that although the methods with the depth features and CTSF-1 generate a similar number of object proposals, the quality of the proposals of the masks using depth measure are higher.

43 Results Scene Classification For the scene classification task, we combined and compared the new context based feature, that is generated from the spatial relations between the objects in the scene, to two state-of-the-art CNNs. For this comparison, we use to the mean Average Precision (map) score, which is discussed first, before analysing the results of the classification task Evaluation Metrics In order to evaluate the results, it is not sufficient to simply measure the number of correctly classified images compared to the total. After all, this measure provides little information, apart from showing your method labelled a number of classes correctly. A scenario where this measure would not be sufficient is a dataset where for instance 80% of the data belongs to a single class, while all other classes are contained within the remaining 20%. Accuracy as computed by Accuracy = #Correctlyclassifiedimages, (5.5) #Total images could achieve a high result by simply returning the most occurring class. This is undesirable and we wish that the method correctly classifies all classes, instead of just ignoring the least occurring classes. A common approach to measure the performance of a classification method across all classes is to compute the mean Average Precision (map). This measure involves computing the Average Precision (AP) [73] for all classes, and taking the average over these AP values. One can compute the AP by calculating the precision and recall at every position in the ranked output of a method, plotting the precision-recall curve, where precision p(r) is plotted as a function of r, and computing the area under the precision-recall curve. Intuitively the AP metric favours correctly classified items that occur early in a ranked output, and its definition is as follows for a given class c: n (P (k) rel(k)) k=1 AP c = #relevant documents, (5.6) where n is the total number of images in the test set, and rel(k) is a boolean indicator whether the document at rank k is of class c.

44 Results 39 The map score for a classifier is computed by: map = 1 C C AP c, (5.7) c=1 where C is the total number of classes. The benefit of this metric is that even if there is an unbalanced dataset, all classes are weighted equally, which rewards good overall classification over just classifying the most occurring classes correctly Combining measures Table 5.2 displays the results of the experiments as described in Section 4.5. The results show that the combination of RGB, HHA, and the novel depth features achieves the maximum map score of 70.1%, while the HHA feature achieves the lowest map score of 46.3%, although only marginally worse than the depth feature by itself, with an map score of 46.5%. Another interesting finding from the results is that the combination of HHA and depth features performs better than the RGB feature. As both the HHA feature and the depth feature only use the depth channel of the RGB-D data, it can be derived that the depth channel has potential to be sufficient by itself for scene classification. (R)GB (H)HA (D)EPTH R+H R+D D+H R+H+D Scene AP AP AP AP AP AP AP Bathroom 94.3 ± ± ± ± ± ± ± 0.6 Bedroom 76.4 ± ± ± ± ± ± ± 1.7 Classroom 68.6 ± ± ± ± ± ± ± 2.5 Computer room 36.1 ± ± ± ± ± ± ± 9.1 Conference room 34.3 ± ± ± ± ± ± ± 2.6 Corridor 54.0 ± ± ± ± ± ± ± 3.0 Dining area 58.3 ± ± ± ± ± ± ± 3.9 Dining room 40.3 ± ± ± ± ± ± ± 8.4 Discussion area 26.4 ± ± ± ± ± ± ± 6.9 Home office 26.2 ± ± ± ± ± ± ± 1.1 Kitchen 73.6 ± ± ± ± ± ± ± 1.6 Lab 49.8 ± ± ± ± ± ± ± 4.8 Lecture theatre 44.3 ± ± ± ± ± ± ± 3.3 Library 50.4 ± ± ± ± ± ± ± 4.3 Living room 39.6 ± ± ± ± ± ± ± 2.1 Office 61.7 ± ± ± ± ± ± ± 1.5 Office kitchen 35.0 ± ± ± ± ± ± ± 7.4 Printer room 48.7 ± ± ± ± ± ± ± 1.8 Recreation room 4.9 ± ± ± ± ± ± ± 8.4 Rest space 67.6 ± ± ± ± ± ± ± 1.9 Study space 23.3 ± ± ± ± ± ± ± 7.0 map map map map map map map 59.6 ± ± ± ± ± ± ± 0.4 Table 5.2: AP and map results of the Scene classification task, using three-fold crossvalidation. Results are AP and map in percentages, the highest scores per category are bolded. Overall, the combination of RGB, HHA, and depth features shows the highest performance.

Results 40 From Table 5.2 one can also see that without any combination of features, the RGB CNN feature yields the best results.

If RGB-D data would have been available in similar quantities, the HHA CNN could have received significantly more fine-tuning, or it could even be trained from scratch.

context. For instance, when we look at the results of Table 5.

For scene classes such as Bathroom and Bedroom the depth feature performs well, because objects that are likely to appear in those scene, such as a bed, toilet, and shower do not really appear in any

(a) Lecture theatre (b) Lecture theatre (c) Lecture theatre (d) Classroom (e) Classroom (f) Classroom Figure 5.2: Samples of the lecture theatre class and the classroom class.

When inspecting the annotation for this class, we can often see that long tables in the scene, as can be seen in Figure 5.2a are classified as a single long table.

45 Results 40 From Table 5.2 one can also see that without any combination of features, the RGB CNN feature yields the best results. A possible explanation is that there are significantly more RGB datasets available compared to RGB-D datasets, allowing more training for the RGB CNN. If RGB-D data would have been available in similar quantities, the HHA CNN could have received significantly more fine-tuning, or it could even be trained from scratch. The new depth features probably performs better than the HHA feature because it does not just use pixel information, but also uses object-object relations within the image, which provides more context. For instance, when we look at the results of Table 5.2, we can see that a combination of just RGB and depth features achieves higher scores for classes such as Bathroom, Bedroom, and Dining room when compared to combining all three features. For scene classes such as Bathroom and Bedroom the depth feature performs well, because objects that are likely to appear in those scene, such as a bed, toilet, and shower do not really appear in any of the other classes. At the same time, beds, and toilets are usually not perfectly straight planar surface, which would influence the HHA feature, introducing more noise. (a) Lecture theatre (b) Lecture theatre (c) Lecture theatre (d) Classroom (e) Classroom (f) Classroom Figure 5.2: Samples of the lecture theatre class and the classroom class. Although HHA, usually performs worse than the novel depth feature, there are also instances where it performs better. One example of such classes is the Lecture theatre class. When inspecting the annotation for this class, we can often see that long tables in the scene, as can be seen in Figure 5.2a are classified as a single long table. The HHA feature works well for planar objects, because the height above the ground is equal for the entire table, the direction of the normals of the table are equal along the plane, and also the angle of the object is the same. The image is harder to classify for the depth feature, because there is just a single table and some chairs in the scene. If we look at the confusion matrix of the R+H+D feature, which is represented as a heat map in Figure 5.3, we can see that the Lecture theatre is labelled correctly eleven times, while it is incorrectly labelled as a Classroom 30 times. One could argue that Lecture

46 Results 41 theatre is probably too similar to the more occurring Classroom. Looking at a sample of the two classes in Figure 5.2, it is derived that there are only minor difference between the two classes. The biggest difference that can be found is probably that the lecture theatre has elevated seats for students and either no tables in the image or very long table, while the classroom is more evenly leveled and has separated tables per seat. The main difference between the classes is the context, where classrooms are present in high schools and lecture theatres in universities. The general layout of having tables in rows, seats in rows, and either a blackboard or whiteboard is shared between the classes. Figure 5.3: A heat map of the confusion matrix of the results of R+H+D scene category classification. Table 5.2 also provides additional insights into the average performance per class. For instance, the bathroom, bedroom, and kitchen enjoy high AP scores throughout all the tested measures, while the recreation room class performs significantly worse than most of the other scene categories with a map score of 12.5%. A possible explanation is that

Results 42 the recreation room class is the least occurring class in the dataset, and that there is not enough training data for this class to be

If we look at a small sample of the recreation room images in the dataset in Figure 5.

4c displays table tennis, which would be quite suitable for a recreation room, while Figure 5.

4f. Figure 5.4h only displays a whiteboard on a wall, which essentially could be almost any school, university, or office related class.

5.4: Samples of the recreation room class and the rest space class.

number of misclassifications, we look at the confusion matrix of the HSV R+H+D approach in Figure 5.3.

The rest space class contains scenes that are similar to the scenes found in the recreation room class, as can be seen in Figure 5.4.

scene class would be in order to provide a definite answer as to why this class is not being correctly classified with the current approach.

47 Results 42 the recreation room class is the least occurring class in the dataset, and that there is not enough training data for this class to be classified correctly. Another possible explanation is the class is simply too similar to other classes. If we look at a small sample of the recreation room images in the dataset in Figure 5.4, we see that recreation room can hold a variety of different scenes. For instance, Figure 5.4c displays table tennis, which would be quite suitable for a recreation room, while Figure 5.4a could easily have been labelled as a rest space class when you compare it to Figure 5.4f. Figure 5.4h only displays a whiteboard on a wall, which essentially could be almost any school, university, or office related class. (a) Recreation room (b) Recreation room (c) Recreation room (d) Recreation room (e) Rest space (f) Rest space (g) Rest space (h) Rest space Figure 5.4: Samples of the recreation room class and the rest space class. In order to determine with more certainty whether the wide variety of different scenes in the recreation room class is the cause for the high number of misclassifications, we look at the confusion matrix of the HSV R+H+D approach in Figure 5.3. When we look at the distribution of the row of recreation room, we can see that it is most often confused with rest space. The rest space class contains scenes that are similar to the scenes found in the recreation room class, as can be seen in Figure 5.4. On another note, you can see that the rest space class does not occur in the first place, and experiments with a higher number of examples of this scene class would be in order to provide a definite answer as to why this class is not being correctly classified with the current approach. Finally, the effectiveness of the proposed depth feature is further explored by comparing it to a context feature based on just object co-occurrences in a scene. The results indicate that the proposed depth feature with the current parameters performs worse than a co-occurrence context feature. The map score of the co-occurrence feature is 51.9% compared to 46.5% of the depth feature. When used in conjunction with the HHA and RGB features the performance in increased from 70.1% to 71.9% when compared to the combination of the three original features. These results are unexpected, as you

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time