CSC494: Individual Project in Computer Science

Size: px

Start display at page:

Download "CSC494: Individual Project in Computer Science"

Myra Powell
5 years ago
Views:

1 CSC494: Individual Project in Computer Science Seyed Kamyar Seyed Ghasemipour Department of Computer Science University of Toronto Abstract One of the most important tasks in computer vision is object recognition. In this work we explore to what extent local descriptors computed at keypoints can be useful for this task. We experiment with different combinations of keypoint detectors and local feature descriptors on a dataset of point clouds of rendered CAD models as well as a dataset of point clouds of individual object instances segmented out from RGB-D images of natural scenes. Our results demonstrate that by jointly taking into account the features computed from an object, a simple nearest neighbour classification framework can result in interesting classification performance. 1. Introduction One of the most important components of an object recognition pipeline is the representations of the objects that are used. The most successful methods in this area iteratively build more complex, higher order representations by combining lower order ones to create more semantic representations. In this report, on the other hand, we explore the efficacy of using purely local feature descriptors, extracted at specific interest points, for the task of object recognition. fiin the task of recognizing a specific object instance across multiple scenes, in the presence of clutter, and with variations in viewpoint, a standard pipeline is to employ a keypoint detection algorithm to find salient, repeatable interest points on the object, and then proceed to encode, in a sufficiently unique manner, the detected interest points and their surrounding neighborhoods using a feature descriptor. At test time, given a novel scene, the features of the objects of interest are compared against features which were extracted from the scene in a similar manner. Given sufficient consensus between the relative locations of the matched features in the scene and the ones on a previously seen object, the object is noted to have been discovered in the scene. In this work, we explore to what extent this pipeline can be used for object recognition. We experiment with combinations of two keypoint detectors and two feature descriptors on two different datasets: a dataset of point clouds generated from CAD models of 10 common household objects, and a dataset of point clouds of the same object classes extracted from natural indoor scenes. 2. Datasets As mentioned above, we experiment with artificial as well as natural data. This allows us to more effectively analyze the ability of our approach; the artificial set provides an easy, noiseless benchmark for testing our method and the natural, noisy set gauges transferability to real-world scenarios. Both datasets consist of point clouds of individual object instances from 10 classes of household objects. The objects considered in our experiments are: bathtub, bed, chair, desk, dresser, monitor, night stand, sofa, table, and toilet. We proceed to briefly describe the method of generation of each dataset Artificial Data To generate our artificial data, we made use of a collection of CAD models (of objects from the aforementioned classes) gathered by Wu et al. [7]. To simulate having a 2.5D point cloud, given each CAD model, a uniform grid of 12 points was placed on a sphere centered at the given object [8]. Subsequently, the object was rendered from the viewpoint of cameras placed on the grid points, looking towards the center of the sphere. The depth buffer of each of the renderings was then converted to a point cloud to generate our artificial dataset. In our experiments with the artificial data, we test our object recognition method for two scenarios: recognizing previously seen objects from novel viewpoints and recognizing novel object instances. For the purposes of our experiments, we created three data splits: Train Set: To create the training split, for each object class 20 instances were chosen, and for each instance, 4 of the 12 views were picked at random. 1

2 New Views Test Set: This set was created by randomly choosing 2 other views for the instances in the training set. Novel Objects Test Set: The test set of novel object instances was created by randomly choosing 4 of 12 views for 5 different instances of each object category Natural Data In our experiments, we also made use of the NYU Depth Dataset V2 [2]. This dataset consists of RGB-D images of various indoor scenes, with instance-level segmentations of the objects in the images. Using these segmentation annotations, we separated out instances of the objects of interest from the images and converted them into point clouds. The train and test sets for experiments with natural data contained 12 and 3 instances per object category respectively. An important issue in dealing with point clouds of objects extracted from natural cluttered scenes is that they can be highly occluded to the extent that they contain no distinguishable characteristics of the object categories they belong to. Although occlusion is an important challenge in computer vision that needs to be dealt with, our framework processes object instances in isolation, and as a result, is not able to work with high degrees of occlusion (more on the importance of context in Section 8). Therefore, the point clouds used in our experiments were chosen by manually sifting through the extracted data and choosing those that bared resemblance to the object categories they were meant to represent. Figure 1 shows samples of object instances used in our experiments as well as examples of point clouds that were dismissed due to not being representative of their classes. our point clouds, the mean and the standard deviation of the average distance of points to their 50 nearest neighbours were computed for each cloud. Points whose computed values were outside one standard deviation from the mean were marked as outliers and removed from the point clouds. Figure 2 shows an example point cloud before and after the removal of outliers. 3. Keypoint Detectors Keypoints in images or point clouds are points which are deemed to be more interesting than other points relative to a set of criteria. In the task of detecting the same object across different scenes, the most desirable property of keypoints is their repeatability; this ensures that the same keypoints can be detected and potentially matched across different viewpoints and scenes. Towards the goal of repeatability, most keypoint detection algorithms analyze the distribution of geometric attributes in a local neighbourhood and choose points where these distributions have high variance. This allows these algorithms to find regions with nongeneric and interesting geometric structures. Under the pretense that local surface features are informative for general object recognition, keypoints would also play an important role from a computational point of view. Computing local surface features at every point in a point cloud is an expensive process. Furthermore, at test time, given a new object, analyzing and comparing its features to those that have been previously seen can become an arduous task if non-parametric methods (such as k-nearestneighbours) are used. In the following sections, we present the keypoint detection algorithms that we compared in our experiments Intrinsic Shape Signatures Keypoint Detectors In [9], Zhong presents a 3D shape descriptor called the Intrinsic Shape Signature (ISS). One of the keypoint detection methods that we consider is the one used in the interest point extraction step of ISS. Figure 1: Samples from chair, table, and desk categories (Green: used in dataset, Red: dismissed) Data Preprocessing Either due to measurement errors or simply due to the angle of a surface with respect to the camera, point clouds can contain stray outlier points. To remove these points from Method The salience of a given point is determined by analyzing the eigenvalues of the scatter matrix of the points in its neighbourhood. More formally, let r density denote a radius used for estimating the density of points around a point of interest, p i, and r nbhd denote a radius used for determining its 1 salience. Additionally, let w i = {p j: p j p i <r density } represent a measure of density. The procedure for determining whether p i is a keypoint is as follows: 2

Figure 2: An example point cloud before and after outlier removal. 1. The weighted scatter matrix about p i is computed as: w j (p j p i )(p j p i SC(p i ) = )T p j p i <r nbhd p j p i <r nbhd w j 2.

Points where λ2 i < γ λ 1 21 and λ3 i < γ i λ 2 32 are chosen as the i potential set of keypoints (γ 21 and γ 32 are parameters that can be tuned). 4.

3 Figure 2: An example point cloud before and after outlier removal. 1. The weighted scatter matrix about p i is computed as: w j (p j p i )(p j p i SC(p i ) = )T p j p i <r nbhd p j p i <r nbhd w j 2. The eigenvalues of SC(p i ) are computed next. λ 1 i, λ 2 i, λ3 i represent the eigenvalues in order of decreasing magnitude. 3. Points where λ2 i < γ λ 1 21 and λ3 i < γ i λ 2 32 are chosen as the i potential set of keypoints (γ 21 and γ 32 are parameters that can be tuned). 4. The final set of keypoints are determined by nonmaximal suppression using the magnitude of λ 3 i Intuition The eigenvalues of SC(p i ) represent to what extent the positions of points in the local p i neighbourhood vary along three orthogonal axes. Therefore the behaviour or the keypoint detection method outlined above depends heavily on the values of γ 21 and γ 32. If γ 21 and γ 32 are both large, then all points pass through step 3, and the resulting set of keypoints will contain points whose neighbours are scattered in every direction. However, on flat surfaces, such as the bottom of a chair, the magnitude of the third eigenvector is very small in comparison to the first two. In such regions, as a result of the non-maximal suppression step, the algorithm will behave similarly to uniform sampling of points. Setting γ 21 and γ 32 to be too small also have side-effects. If γ 32 is small, the keypoints will capture flat surfaces which is not a desirable property. However, making γ 21 small will help to capture edges since at a point on a 3-dimensional edge, there will typically be two directions with neighbours on only one side, and a third direction with neighbours on both sides of the points. Lastly, we observe that it would be difficult for this method to detect corners. At corners, all three eigenvalues would have similar magnitudes. Therefore, unless γ 21 and γ 32 are both set to be large, corners will not be detected as keypoints. Indeed, the first image in Figure 3 shows that setting γ 21 and γ 32 to 0.7 and 0.5 respectively, the algorithm did not choose the corners of the table as keypoints Harris 3D Corner Detector The Harris corner detector is a popular method of extracting interest points from 2D images. One of the keypoint detectors that we use in our experiments is the extension of the Harris corner detector to 3D point clouds Method denote the gradients in the x and y directions at the point p i in a 2D image. In the original Harris corner detection algorithm, the magnitude of the eigenvalues of: Let I (j) x and I (j) y M(p i ) = [I x (j) p j p i <r nbhd, I (j) y ][I (j) x, I y (j) ] T are considered to be indicative of the existence of an edge or corner at p i. The extension of this idea to the case of 3D point clouds, as implemented in the Point Cloud Library (PCL) [4] is to replace the matrix M with the covariance of the normals at the points in the neighbourhood of p i : Intuition COV (p i ) = p j p i <r density N j N T j The method described in 3.1 cares about the spatial distribution of points in a local neighbourhood. However, this information does not say much about the curvature of the 3

Figure 3: Keypoints detected on a table (left: ISS, right: Harris). Figure 4: Keypoints detected on a chair using different methods (in order from left to right: ISS, Harris, Uniform). local region.

A comparison of keypoints detected with the two methods discussed thus far can be seen in Figures 3

However, for building representations of objects, high variance regions are not the only informative regions.

Hence, as a baseline to the two keypoint detection techniques mentioned above, we also experiment with uniformly sampling points from our point clouds as a replacement for interest point detection

4 Figure 3: Keypoints detected on a table (left: ISS, right: Harris). Figure 4: Keypoints detected on a chair using different methods (in order from left to right: ISS, Harris, Uniform). local region. On the other hand, the Harris 3D corner detector does consider this information by caring about to what extent the direction of the surface normals vary along different orthogonal directions. A comparison of keypoints detected with the two methods discussed thus far can be seen in Figures 3 and Uniform Sampling As mentioned before, keypoint detection methods tend to care about non-generic regions with interesting structures. However, for building representations of objects, high variance regions are not the only informative regions. The existence and the distribution of smooth and flat regions could also be valuable information. Hence, as a baseline to the two keypoint detection techniques mentioned above, we also experiment with uniformly sampling points from our point clouds as a replacement for interest point detection algorithms Practical Notes & Implementation Details Two important parameters that need to be set for the computation of ISS keypoints are r nbhd and the radius of non-maxima suppression. Data from our point clouds can contain a significant amount of noise (this is especially true for natural data). Therefore, we would not want these radii to be too small. After some parameter tuning, we decided to set r nbhd and the radius of non-maximal suppression to be respectively 12 and 8 times the model resolution, where the model resolution of a point cloud is computed as the average distance of a point to its nearest neighbour in the cloud. For the Harris keypoint detector, we set the radius used for performing the computations to be 8 times the model resolution. Additionally, this detector also requires surface normals to be computed. If the support radius used to compute the normals is too small, the computations will be susceptible to noise. On the other hand, if this radius is too large, the normals inside a local region will be very similar to one another, thereby negatively affecting the keypoint detection process. Eventually, we decided to set this parameter to be 3 times the model resolution. For uniform sampling, we randomly sampled points from the point cloud for each object Statistics The table below presents the mean ratio of number of keypoints detected to the number of points in the point clouds for objects in the training sets of each dataset. The 4

Dataset ISS Harris 3D Uniform Natural Data 0.0048 0.0043 0.0031 Artificial Data 0.0027 0.0036 0.0032 Table 1: Ratio of keypoints to number of points in point cloud.

5 Dataset ISS Harris 3D Uniform Natural Data Artificial Data Table 1: Ratio of keypoints to number of points in point cloud. larger ratios for the natural dataset is a result of the noisier nature of the data. 4. Local Descriptors After performing keypoint extraction from the data of interest, the next step in our experimental pipeline is to compute local descriptors at the salient points. The role of a descriptor computed at a given point is to represent the properties of the local surface around the point in a compact, yet sufficiently unique manner. Below, we discuss two local descriptors that we used in our experiments Fast Point Feature Histograms In [5], Rusu et al. present the Point Feature Histogram (PFH) as means of capturing the geometrical properties of the neighbourhood of a point. In later work, [3], presents a modification of PFH named Fast Point Feature Histogram (FPFH) which significantly reduces the computational cost associated with the feature computation. We proceed by first describing the how PFH are computed and explain how it has been modified to produce the FPFH. Figure 5: Visualization of local frame and angles used for the computation of Point Feature Histograms [1]. 4. Given this frame, the quadruplet α, φ, θ, d for the pair (p s, p t ) is formed where: d = p t p s α = v n t φ = u (p t p s ) d θ = arctan(w n t, u n t ) 5. For 2.5D images, however, the distance between neighbouring points differs across viewpoints. Therefore, it is common to eliminate the d element from the quadruplet. To create the PFH descriptor for the point p q, the triplets α, φ, θ are computed for every pair of points in S(p q ). α, φ, and θ are each binned using 5 bins, creating a total of 5 3 bins for the triplets. The Point Feature Histogram for p q is then taken to be the 125-dimensional histogram of the computed triplets Point Feature Histograms Point Feature Histograms attempt to capture the geometry of a local region by taking into account the relative direction of the normals in that region. Given a point of interest, p q, the PFH for that point is computed as follows: 1. Let S(p q ) = {p t : p q p t < r nbhd } where r nbhd is a hyperparameter of the feature computation. 2. For a given pair of points (p s, p t ) in S(p q ), with normals (n s, n t ), the point whose normal makes the smallest angle with p s p t is chosen as the source. 3. Without loss of generality, we assume the source to be p s. A frame about the point p s is created using the orthonormal basis (u, v, w) defined as follows: u = n s v = (p t p s ) p t p s w = u v Fast Point Feature Histograms Computing Point Feature Histograms incurs a high computational cost due to the fact that the mentioned triplets are computed for every pair of points in the neighbourhood of the point we are considering. The Fast Point Feature Histogram [3] attempts to remedy this issue as follows: 1. In the first step, Simplified Point Feature Histograms (SPFH) are computed at every point in the cloud. For a given point p q, SPFH is computed in a very similar manner as PFH with 2 main differences: The triplets α, φ, θ are only computed between p q and its neighbours. Instead of jointly binning the values of the triplets, a histogram of 11 bins is made for each of α, φ, and θ separately, and the resultant histograms are concatenated to form a 33- dimensional feature vector. 5

6 2. In the second step, the FPFH feature for point p q is computed to be: F P F H(p q ) = SP F H(p q ) 1 + S(p q ) p k S(p q) where w k is the distance of p q to p k Intuitions 1 w k SP F H(p k ) FPFH attempts to capture the geometry of a local region by measuring how the direction of the normals in the region change relative to one another. However, without information about the relative location of the normals, the same descriptor could potentially represent many types of surface geometries. This, in addition to the fact that α, φ, and θ are binned independently raises a concern about how distinctive FPFH features are in terms of representing surfaces with different structures. On a positive note however, the fact that FPFH features only consider the relative directions of normals means that FPFH features are pose invariant and should in theory produce the same histograms when objects are seen from different viewpoints SHOT Descriptor In [6], Tombari et al. present the Signature of Histograms of Orientations (SHOT). This is the second descriptor that we employed in our experiments Method To compute the descriptor at a point p i, the SHOT descriptor first necessitates the computation of a local reference frame. This reference frame is computed at follows: 1. Let M be: M = j:d j r nbhd (r nbhd d j )(p i p j )(p i p j ) T j:d j r nbhd (r nbhd d j ) where r nbhd is a hand tuned parameter. 2. The directions of the eigenvectors sorted in decreasing order of eigenvalue magnitude are taken in order to be the directions of the x, y, and z axes of the local reference frame. We will denote these eigenvectors with x, y, z. 3. Let S x + = {j : d j < r nbhd (p j p) x 0} and Sx = {j : d j < r nbhd (p j p) ( x) > 0}. The positive direction of the x axis for the reference frames Figure 6: Spherical grid used in the computation of the SHOT descriptor. is set to be the direction of x if S x + > Sx and x otherwise. This essentially means that the direction that contains the most number of points is considered to be the positive direction. 4. The positive direction of the other axes are determined in a similar fashion. Given the computed local reference frame at p i, the SHOT descriptor is computed as follows: 1. A spherical grid similar to the one shown in Figure 6 is placed centered at p i. The spherical grid has 8 divisions along the azimuth, 2 divisions along the elevation, and 2 divisions of the distance of a point to the center of the sphere. 2. For each division in the grid independently, a histogram is created by binning the values of cos(θ j ), where θ j is the angle between the surface normal at a point p j inside the division and the surface normal at p i. 11 bins are used for this computation. 3. The computed histograms from the divisions are concatenated together and the resulting vector is normalized to so that its components have a sum of one Intuitions The method outlined above attempts to creates a more fine grained descriptor. There is however a significant amount of ambiguity that results from solely binning the values of cos(θ j ). Although θ j tells us to what extent the normals deviate from the normal at p i, it does not tell us in which direction it deviates. This is a significant amount of ambiguity and many different types of surfaces could potentially produce the same descriptor. 6

7 4.3. Practical Notes & Implementation Details To use the local descriptors for the purpose of object recognition, we decided to choose a not so small value for r nbhd so as to capture the geometry of a larger region and deal with the presence of noise. We decided to set this value to 12 times the resolution of the point clouds. Similar to the case when detecting keypoints, the support radius for normal computation was set to 3 times the resolution of the point clouds. There also exists a practical issue when working with SHOT descriptors; they are very high dimensional descriptors (352 for SHOT vs. 33 for FPFH). This creates a computational problem for performing nearest neighbour queries. To resolve this issue, using PCA, 30 highly informative orthogonal axes of the SHOT descriptors were identified by looking at the training set. Subsequently, all SHOT descriptors from the training and test sets were preprocessed by being projected onto the derived axes (this was done for each dataset independently). 5. A Priori Expectations A priori, we do not expect the method we employ in this work to produce amazing results. To begin with, the features that we extract are local surface descriptors. It is quite possible to significantly modify the local structure of points in a point cloud (for example by adding sufficient amounts of noise) while still preserving a global structure that allows for the recognition of the object by a human, but not by our method. Additionally, in our framework, we are not taking into account the relative positions of the keypoints at which we extract features. Even by encoding this information, we would still be required to deal with the ambiguities associated with working with rotation invariant feature descriptors. However, if we observe positive results in our experiments, this will indicate that the local features we extract are able to represent non-trivial aspects of the objects they are derived from, providing motivation for future work to attempt to incorporate this information with more global properties in order to build better representations of objects. 6. Experiments In this section we discuss the experiments we carried out using our pipeline. As mentioned in section 2, we worked with both an artificial and a natural dataset. For the artificial dataset we experimented with recognizing previously seen objects from new viewpoints in addition to recognizing novel objects. For the natural dataset however, we only experiment with recognizing previously unseen objects from the given object categories Distinctiveness of Individual Keypoints As a first experiment, it would be interesting to explore to what extent features extracted from objects are indicative of the class of the object. To this effect, we performed two tests: k-nn If individual features extracted at keypoints are representative of the object category, nearest neighbours classification of the features should produce results better than chance. For a given value of k, nearest neighbours classification was done by first determining the k nearest neighbours to a query point, q, and creating a voting mechanism in which the neighbours vote for their object class using a weight inversely proportional to their distance from q. The plots in Figure 9 demonstrate the classification accuracy of k-nn for k {1, 3, 5, 7, 9} on both datasets for every combination of keypoint extractor and feature descriptor. First, the fact that we obtain results consistently better than chance indicates that there may indeed exist information that can be leveraged from the local descriptors. Interestingly, the plot for the new views test set of the artificial data seems to produce a strict ordering of the quality of the keypoint/descriptor combinations. Since good performance on this test set would require the repeatability of keypoints and distinctiveness of the descriptors, the results indicate that the ISS keypoints are better repeatable than Harris corners, and that FPFH features better describe the local surfaces of the objects. On novel object test sets for both datasets however, uniform methods achieve close performance to keypoint-based ones and the benefit of having keypoints diminishes. Lastly, we note that FPFH based methods do not achieve 100% accuracy for 1-NN on the training sets. This is due to the fact that FPFH features for various points end up having the same description. This is related to our previous concern regarding the distinctiveness of FPFH but we do not know how to reconcile this with their superior performance discussed in the previous paragraph k-means If the extracted features are indicative of class, after performing k-means clustering on the descriptors, the distribution of class labels in each cluster should be non-uniform and heavily skewed. To perform this analysis, for each dataset, the training set was clustered using k-means clustering with 50 clusters. Subsequently, the features from the test sets were assigned to the cluster with the closest center. The figures on pages 9 and 10 demonstrate the distribution of class labels inside the clusters for the various combinations of keypoints and descriptors. In the visualizations, 7

8 Artificial Data Natural Data Figure 9: k-nn accuracy for classification of individual features columns indicate class and rows indicate the clusters. The brightness of pixels indicates the proportion of data in that cluster belonging to the particular class. The results show that for the artificial dataset, some structure is retained in the distribution patterns across the train, new views test, and novel instances test set. Also, the images for SHOT tend to be brighter than those for FPFH which indicates that the clusters from SHOT descriptors have a more uniform distribution. For the natural dataset however, the distributions of class labels inside the clusters do not stay the same between the train and test set. This is not unexpected since in natural data, due to significant amounts of noise, repeatability of features is quite hindered Vote Aggregation Individual features extracted from point clouds are susceptible to noise, and as can be seen in the results from 6.1.1, classification using these features independently, although above chance, does not produce very good results. 8

FPFH + ISS FPFH + Harris FPFH + Uniform SHOT +

Distribution of labels in each of 50 cluster for

ISS SHOT + Harris SHOT + Uniform Figure 23: NEW

ISS SHOT + Harris SHOT + Uniform Figure 30:

In this experiment instead, the individual

determined by the k-nn classification decision.

taken as the prediction for the full object.

results are a mix of performance gain and loss

Looking at the results obtained from the various

preserved; for the artificial dataset, the SHOT

FPFH, whereas the case was the reverse when

We do not have a justification for why this may

One aspect that is preserved between the results

1 and here is the fact that keypoint based

accuracies for only the new views test set of the

9 FPFH + ISS FPFH + Harris FPFH + Uniform SHOT + ISS SHOT + Harris SHOT + Uniform Figure 16: Distribution of labels in each of 50 cluster for TRAIN set of ARTIFICIAL DATA. FPFH + ISS FPFH + Harris FPFH + Uniform SHOT + ISS SHOT + Harris SHOT + Uniform Figure 23: Distribution of labels in each of 50 cluster for NEW VIEWS TEST set of ARTIFICIAL DATA. FPFH + ISS FPFH + Harris FPFH + Uniform SHOT + ISS SHOT + Harris SHOT + Uniform Figure 30: Distribution of labels in each of 50 cluster for NOVEL INSTANCES TEST set of ARTIFICIAL DATA. In this experiment instead, the individual features of an object voted for a class determined by the k-nn classification decision. The majority vote of the predicted classes was taken as the prediction for the full object. Figure 47 presents the results obtained from this experiment. At first glance, classification performance for the artificial dataset is improved very significantly whereas for natural data, the results are a mix of performance gain and loss for different types of keypoint-descriptor pairs. Looking at the results obtained from the various test sets, what is quite odd is that the relative ordering of how well different keypoint-descriptor combinations perform is not preserved; for the artificial dataset, the SHOT descriptor is actually doing a better job than FPFH, whereas the case was the reverse when classifying individual features. We do not have a justification for why this may have occurred. One aspect that is preserved between the results from and here is the fact that keypoint based feature computations results in noticeably better accuracies for only the new views test set of the artificial dataset, but the performance gap shrinks in when working when novel object instances are considered. 9

FPFH + ISS FPFH + Harris FPFH + Uniform SHOT + ISS SHOT + Harris SHOT + Uniform Figure 37: Distribution of

FPFH + ISS FPFH + Harris FPFH + Uniform SHOT + ISS SHOT + Harris SHOT + Uniform Figure 44: Distribution of

Histogram of Histograms The last experiment that we did was to use a global representation of the objects

Features from the training set were clustered using k- means clustering with 50 clusters. 2.

number of features from the given object that belong to a given cluster. 3.

1 with the difference that here the data points are the computed histograms.

This set of plots shares similar properties to the two previous ones from experiments 6.1.1 and 6.

the performance on the artificial dataset seems to be in between those of experiments 6.1.1 and 6.2.

classification accuracy on the natural data test set, which we were not able to do in experiment 6.2.

2 were hinting that the distribution of prototypical features were changing significantly between the natural

Conclusions In this work, we explored to what extent local descriptors computed at keypoints can be useful for

10 FPFH + ISS FPFH + Harris FPFH + Uniform SHOT + ISS SHOT + Harris SHOT + Uniform Figure 37: Distribution of labels in each of 50 cluster for TRAIN set of NATURAL DATA. FPFH + ISS FPFH + Harris FPFH + Uniform SHOT + ISS SHOT + Harris SHOT + Uniform Figure 44: Distribution of labels in each of 50 cluster for TEST set of NATURAL DATA Histogram of Histograms The last experiment that we did was to use a global representation of the objects derived from local feature descriptors. The representation that we used was computed as follows: 1. Features from the training set were clustered using k- means clustering with 50 clusters. 2. For objects in both train and test sets, a histogram with 50 bins was created where the bins represent the number of features from the given object that belong to a given cluster. 3. The histograms, normalized so that the values in the bins sum to one, are subsequently used as the representations of the objects. Classification using these representations was done in a similar fashion to experiment with the difference that here the data points are the computed histograms. The plots in Figure 50 present the classification results. This set of plots shares similar properties to the two previous ones from experiments and 6.2 (such as the keypoints being more relevant to the new views test set than for the novel objects test set) and the performance on the artificial dataset seems to be in between those of experiments and 6.2. However, the most interesting result from this experiment is that we were able to significantly improve classification accuracy on the natural data test set, which we were not able to do in experiment 6.2. This is quite surprising to us since the results from section were hinting that the distribution of prototypical features were changing significantly between the natural data train and test sets (further discussion in conclusions section). 7. Conclusions In this work, we explored to what extent local descriptors computed at keypoints can be useful for object recogntion. We experimented with with FPFH and SHOT descriptors computed at keypoints computed using 3 different methods, ISS Keypoints, Harris 3D Corners, and Uniform Sampling. To adapt this framework from instance detection to object recognition, we used larger radii of salience to capture information that is less local. Our experiments show that individual features on their own do not contain enough information for doing good classification. However, when the predictions for the individual features from an object are combined to make a judgement about object class, there is a very significant improve- 10

11 Artificial Data Natural Data Figure 47: k-nn accuracy for instance-level classification ment in performance for the artificial dataset. This, however, does not improve classification accuracy for natural data significantly. The reason could be that due to the significant amount of noise in natural data, keypoint detection does not perform well and descriptors do not represent the true surface shape. Additionally, we note that the similarity of accuracies on the novel objects artificial test set with or without the use of keypoints (keypoints vs. uniform sampling) could indicate that smooth regions are also informative of the class of objects. Lastly, an interesting result that we observed was that classification results for natural data are significantly improved in the experiment in section 6.3. The results on artificial data indicated that when considered jointly, the computed features can be useful for recognizing the class of an object. Representing objects using histograms of prototypical features could be considered as also doing the same with the added benefit of robustness to noise due to substituting features with their prototypes (cluster centers). Since natural data are noisy, the results on those test sets were im- 11

12 Artificial Data Natural Data Figure 50: k-nn accuracy for classification using normalized histogram of prototypical features proved whereas the results for artificial data were comparable to those obtained in experiment Future Directions The work in this report is applicable to a constrained situation. First, we required that object instances be separated. Segmenting out individual objects from cluttered scenes is a very difficult task and if we do not perform the segmentation, our feature computations will be inaccurate. Second, we only experimented with non-occluded (artificial data) or not-heavily-occluded (natural data) objects. Self-occlusion was present in our data however. This will also pose a major problem for us as our best results were achieved with the aggregation of votes from the features over the entire object. Another problem with our approach is that we treat objects in isolation. Context is extremely important in helping with classification, especially when significant degrees of occlusion come into play. For example, if we can recognize some chairs in a seen, then a flat plane near the chairs 12

13 would likely be a table. We would expect that modelling these interactions between object categories would be extremely valuable for object recognition. Lastly, in this work we showed that local features can be combined to capture discriminative properties of object categories. Taking this a step further would be to create a hierarchical representation of objects using features computed in a fashion similar to ours. Furthermore, one could envision a method in which local descriptors computed at keypoints are combined with representations of the smooth surfaces in an object to better capture the varying geometries of different object categories. References [1] Point feature histograms estimation documentation. [2] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, [3] R. B. Rusu. Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD thesis, Computer Science department, Technische Universitaet Muenchen, Germany, October [4] R. B. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). In IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May [5] R. B. Rusu, Z. C. Marton, N. Blodow, and M. Beetz. Persistent point feature histograms for 3d point clouds. In Proc 10th Int Conf Intel Autonomous Syst (IAS-10), Baden-Baden, Germany, pages , [6] F. Tombari, S. Salti, and L. Di Stefano. Unique signatures of histograms for local surface description. In Computer Vision ECCV 2010, pages Springer, [7] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [8] J. Xiao, T. Fang, P. Zhao, M. Lhuillier, and L. Quan. Imagebased street-side city modeling. In ACM Transactions on Graphics (TOG), volume 28, page 114. ACM, [9] Y. Zhong. Intrinsic shape signatures: A shape descriptor for 3d object recognition. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages IEEE,

Bridging the Gap Between Local and Global Approaches for 3D Object Recognition. Isma Hadji G. N. DeSouza

Bridging the Gap Between Local and Global Approaches for 3D Object Recognition Isma Hadji G. N. DeSouza Outline Introduction Motivation Proposed Methods: 1. LEFT keypoint Detector 2. LGS Feature Descriptor