Scalable Object Classification in Range Images

Size: px

Start display at page:

Download "Scalable Object Classification in Range Images"

Elwin Phillips
5 years ago
Views:

Scalable Object Classification in Range Images Eunyoung Kim and Gerard Medioni Institute for Robotics and Intelligent Systems USC Viterbi School of Engineering University of Southern California Los

1 Scalable Object Classification in Range Images Eunyoung Kim and Gerard Medioni Institute for Robotics and Intelligent Systems USC Viterbi School of Engineering University of Southern California Los Angeles, CA, USA Abstract We present a novel scalable framework for freeform object classification in range images. The framework includes an automatic 3D object recognition system in range images and a scalable database structure to learn new instances and new categories efficiently. We adopt the TAX model, previously proposed for unsupervised object modeling in 2D images, to construct our hierarchical model of object classes from unlabelled range images. The hierarchical model embodies unorganized shape patterns of 3D objects in various classes in a tree structure with probabilistic distributions. A new visual vocabulary is introduced to represent a range image as a set of visual words for the process of hierarchical model inference, classification and online learning. We also propose an online learning algorithm that updates the hierarchical model efficiently thanks to the tree structure, when a new object should be learned into the model. Extensive experiments demonstrate average classification rates of 94% on a large synthetic dataset (1,350 training images and 450 test images for 9 object classes) and 88.4% on 1,433 depth images captured from real-time range sensors. We also show that our approach outperforms the original TAX method in terms of recall rate and stability. Keywords-object classification; range images; scalable data structure I. INTRODUCTION Shape-based object classification in range images aims to label objects captured in range images based on the common patterns shared by the other objects in the same class. Its applications include autonomous robotic navigation and manipulation, and urban scene understanding. Object classification using complete 3D models has been actively studied for content-based shape retrieval [1]. However, categorizing objects in range images faces another challenge, as range images provide irregularly sampled 3D points representing only the visible surfaces of objects unlike 3D models. The range images of the same object from different views may be very varied. Therefore, a classification method using range images should be tolerant to intra-class variance in shape (e.g. partial views), but also be strict on distinctive inter-class shape differences. Also, the database should have an expandable structure to learn novel shape patterns of object classes not seen during training. Thus, we present a scalable object Figure 1. System overview: the red box is the focus of this paper classification framework that aims at efficiently categorizing 3D objects captured in range images and updating the database of object classes, when previously unseen data is detected in the scene. Fig. 1 outlines the proposed framework. Our system first segments object candidates (i.e. point clouds) from the scene, and then identifies the class label for each candidate based on observation that 1) our target applications are robotic manipulation and urban scene understanding in LIDAR images where objects rest on planar surfaces (e.g. ground) in noncluttered environments in many cases, and 2) segmentation is inevitable to learn new instances in unsupervised manner. Also, object segmentation improves run-time performance. For example, Spin Image [2] and Tensor matching [3] methods took 2 hrs and 90 sec per object [4], while our method took less than 2 sec per object, including the object segmentation process. The focus of our scalable classification framework includes 1) hierarchical structured database, 2) object classifier and 3) online learning process. 1) Hierarchical structured database (Sect. IV): For fast label inference and online learning, we construct a hierarchical structured database. The TAX model [5] is adopted to build the hierarchical model of object classes

2 from unlabelled range images by mapping each image to a path in the tree, composed of L nodes. Also, we propose a new visual vocabulary as an object representation (Sect. III), which utilizes the spatial context between a supporting planar surface and an object. 2) Object classifier (Sect. V): After segmentation, when the visible surface of a 3D object is given, our classifier identifes its class label. We enhance the original approach on the labeling process in [5]. Our method enables to compute normalized shape similarity between objects and impose local patterns between visual words, which are disregarded during training. 3) Online learning (Sect. VI): We introduce an online learning approach using the TAX model with discussion on new instance/class learning. The outline of the paper is as follows: Section II summarizes related work with our contribution. Section III introduces a new visual vocabulary, and we briefly review the TAX model in Section IV. The details of the proposed classification and online learning processes are described in Section V and Section VI, respectively. Experimental results are shown in Section VII followed by concluding remarks. II. RELATED WORK There are two main approaches to shape-based object classification: global description and local description. Global description based method has been popularly used for content-based shape retrieval [1], [6], [7], but it is not suitable for object classification in range images, which capture noisy partial surfaces of objects. So, the most popular approach is to extract 3D local shape descriptors from the point cloud, and recognize objects in the image by matching them with known objects. Splash [8] captures the distribution of orientations around the reference point, and Spin image [2] represents, given a reference point p, a 2D histogram of (α, β) coordinates of its neighbors, where (α, β) coordinate spans around the orientation of the point p. Besides, spherical spin image [9], normal-based signatures [10], tensor-based representation [3], point pair features [4] are also proposed. However, these works mainly aim to recognize and localize the objects whose exact shape is already defined in the database. For object classification, part-based approach has been mainly used. Huber et al. [11] identifies the class of a given object by inferring parts from similar local shape descriptors (spin image), grouping the parts from all objects into part classes, and mapping part classes to object classes. The method presented in [12] also groups similar local shape descriptors to form a set of shape-class components and uses surface signatures that encode the spatial relationships between their components. [13] combines spin images with other contextual features to classify objects. It is worth noting that spin image mostly exploited in these methods is very sensitive to resolution of 3D points and very slow. Our contribution over these methods includes that : We underline the spatial context between objects and a supporting planar surface, which gives more stable orientation and global description of objects (e.g. height). Note that noisy and unreliable depth measurements from range sensors result in inaccurate surface orientation estimation and may weaken the performance of a local shape descriptor. These works have rarely discussed how to deal with a large number of range images for efficient label inference, when a database should be gradually expanded. Also, they need batch learning every time new instances are added. We address the downside of the TAX model and suggest a method to improve the performance. III. OBJECT REPRESENTATION IN VISUAL VOCABULARY A basic descriptor of the TAX model is a visual word. Thus, every range image should be represented by a set of visual words. We design a new visual vocabulary that utilizes the spatial context with the ground surface as global description, based on observation that objects rest on planar surfaces in many cases. For example, objects in urban area are located on the ground surface, roofs and walls of buildings. Objects in indoor scenes also rest on planar surfaces to be stable. Supporting planar surfaces are used to save computations by bounding 3D points representing interesting objects and define the major orientation of objects in the scene. Many works have been done to extract planar surfaces from range images [14], [15]. In our tests, we fit planar patches using RANSAC and extract planar surfaces by grouping the consistent planar patches. For each planar surface, we segment point clusters on the surface using the surface pose and adjacency of points. Each cluster is considered as an object candidate. To compute visual words, every candidate is transformed into the ground surface coordinate system. It makes our visual words invariant to rotation around the Y-axis (which corresponds to height from the surface), as each object has free rotation about the Y-axis in the ground coordinate system. Note that object segmentation in very cluttered environment is beyond our scope. Interest point sampling Given each segmented point cloud, we uniformly sample interest points and encode them into visual words, since the point cloud usually has thousands of unorganized 3D points. The interest points are sampled corresponding to surface saliency. This prevents the use of points close to the boundaries of objects and noisy 3D points. Surface saliency is determined by Tensor voting (TV) [16]. Given the points with noise and inaccurate depth, the TV process serves to infer surface saliency and more reliable surface orientation for every input point. After the TV process, the point with the highest saliency is selected as an interest point, and its all

(a) Global descriptor (b) Six regions (c) Local descriptor Figure 2. Visual word assignment Figure 3. Graphical model of the TAX neighbors within a certain radius are discarded from the list.

Descriptor computation The next step is to encode every interest point into a visual word. Each visual word w is a 8-dimension integer vector (w R 8 ).

3 (a) Global descriptor (b) Six regions (c) Local descriptor Figure 2. Visual word assignment Figure 3. Graphical model of the TAX neighbors within a certain radius are discarded from the list. This sampling process is iteratively applied to the remaining points until no points are left. The sampling radius depends on the resolution of 3D points. In our experiments, it was set to 1 cm. Descriptor computation The next step is to encode every interest point into a visual word. Each visual word w is a 8-dimension integer vector (w R 8 ). The first two coordinates correspond to global information of a point (Fig. 2(a)) and the remaining 6 coordinates contain local description of a point (Fig 2(c)). Given a 3D point p i = (x i, y i, z i ) with its orientation n i = (n i x, n i y, n i z) (in the ground coordinate system), n i y is used to determine the value of the first coordinate, which indicates the surface type the point p i is on. i.e. 0, if n i y cos( π 6 ) or 1, if cos( π 3 ) ni y < cos( π 6 ) or 2, if ni y cos( π 3 ) The second coordinate encodes the height of the point from the ground, y i. The object is divided into γ regions in terms of height. yiγ max h (see Fig. 2(a)) is assigned to the second coordinate. In our experiments, γ was set to 10. An additional value is used for noisy points much higher than the objects we expect. In total, the second bin can have 11 different values. The remaining six coordinates capture surface smoothness at the point. We align the cylinder (Fig. 2(b)) to the point in order to partition its neighbors into six different regions as illustrated in Fig. 2(c). Then, for every region r, we compute the average orientation similarity, θ r = 1 N r j N r (n i n j ), where N r is a set of neighboring points in the region. Finally, the corresponding coordinate has the degree of surface smoothness defined as: 0, smooth (cos( π 6 ) θ r) 1, weakly smooth (cos( π 3 ) θ r < cos( π 6 )) 2, not smooth ( θ r < cos( π 3 )) 3, if there are no points For example, for the point p i shown in Fig. 2(c), which has no neighbors in the regions A,C,D and F, but smooth surface with the region B and E, its visual word has 3s and 0s for the corresponding coordinates, respectively. As a result, our visual vocabulary has 135, 168 (= ) visual words. Every interest point d in image i has a corresponding visual word w i,d. Our visual word has a coarse description to handle shape variance induced by partial views and inaccurate orientation. Discriminative descriptions between object classes come from the co-occurrence patterns of visual words. IV. HIERARCHICAL MODEL OF OBJECT CLASS To build a hierarchical model of object classes, we adopt the generative model TAX previously proposed to learn a taxonomy from a collection of unlabelled images [5]. The intuition behind this hierarchical model is to group range images with similar shape patterns into the same path, and each pattern is characterized by probabilistic models associated with the frequency of co-occurrence of visual words. We select the TAX model as our database structure, since the TAX model enables efficient label inference and online learning thanks to its tree structure. Also, the generative model shows better performance for a small number of training sets [17]. Our application requires to learn a few instances of a new class incrementally. This section provides a brief review on the TAX model. The TAX model is a hierarchical variant of the Latent Dirichlet Allocation (LDA) model [18]. The LDA was introduced for unsupervised discovery of topics for document classification and popularly used for image retrieval as well. While the LDA has a flat structure, the TAX maps each document to one of the path in the tree.the height of the tree L is should be given. Fig. 3 depicts a graphical model of the TAX, and the complete probability model in generative process is: T ree NCRP(γ), l i,d Mult(1/L) π c Dir(α), φ t Dir(β) z i,d Mult(π cψi,l i,d ), w i,d Mult(φ zi,d ) Intuitively, in this generative model, each range image i is assigned to path ψ i in the tree. The path is chosen by nested Chinese restaurant Process prior (i.e. ncpr(γ)), where γ is a parameter controlling the branching probability [19]. It can be either one of the existing paths in the tree, or a new path split from the existing internal node of the tree. Every node c has a multinomial distribution π c over topics, and each topic t is also modeled by a multinomial distribution φ t over visual words. Both π c and φ t are derived from Dirichlet distributions with hyperparameters α and β, respectively, which affect the relative sparsity of these distributions. Given every interest point d in image i, ψ i is the path that the image is assigned to, and z i,d and l i,d are the topic

and the level assignments of the interest point, respectively. These are the hidden variables that should be inferred from every observed interest point (i.e. w i,d ) during training.

Gibbs sampling approximates the posterior distribuition by iteratively drawing samples of z i,d, l i,d and ψ i from the conditional distributions given the other hidden variables and the

Similarly, p(l i,d = l rest) and p(ψ i = ψ rest) are defined. More details are available in [5]. After Gibbs sampling, we finally infer the hierarchical structured database (HSD). Figure 4.

4 and the level assignments of the interest point, respectively. These are the hidden variables that should be inferred from every observed interest point (i.e. w i,d ) during training. As it is intractable to compute the posterior distribution of these latent variables given the observations, an approximation method, Gibbs sampling is used. Gibbs sampling approximates the posterior distribuition by iteratively drawing samples of z i,d, l i,d and ψ i from the conditional distributions given the other hidden variables and the observations, for instance, p(z i,d = z L, Ψ, W, α, β, γ), where L and Ψ are previous assignments of level and path and W is the given observed visual words. Similarly, p(l i,d = l rest) and p(ψ i = ψ rest) are defined. More details are available in [5]. After Gibbs sampling, we finally infer the hierarchical structured database (HSD). Figure 4. Overview of our classification process V. OBJECT CLASSIFICATION After HSD inference, every training image belongs to one of the complete paths in the tree, and the multinomial distribution ˆφ t for every topic t and the distribution ˆπ c at node c are approximated by the topic and level assignments: ˆφ t,w = β + N t,w, ˆπ c,t = α + N c,t, βw + N t,. αt + N c,. where N t,w is the number of interest points whose visual word is w associated with topic t, N t,. is the number of interest points assigned to topic t, N c,t is the number of interest points assigned to topic t and node c and N c,. is the total number of interest points assigned to node c. So, when a new image j is given, the probability of observing image j given the path ψ is defined as: p(j ψ) = ˆφ t,wj,d ˆπ ψ,l,t d l,t where w j,d is a visual word of interest point d in image j and (ψ, l) is the l-level node of the path ψ. Fig. 4 depicts our approach to object classification using the probability p(j ψ). We first describe a naïve approach motivated from the original TAX model [5]. To deal with the drawbacks of the naïve approach, we then propose a novel visual vocabulary, Pattern from Neighbors(PfN), to enforce discriminative patterns in local visual words and compute normalized shape similarity between objects. The naïve approach is simple. It computes p(j ψ) for every existing path ψ to identify σ-number of the paths with the highest similarity. Let Ψ be the σ-number of the paths. Fig. 4 shows an example of Ψ (red lines, σ = 2). Then, the test image is labeled by majority voting of the objects under the paths in Ψ. In our experiments, σ was always set to 3. Unfortunately, this approach often exhibits very poor and unstable performance. e.g., all cups are labeled as bottle. If some object classes share similar shapes (e.g. red and blue regions in Fig. 5) and the distributions at nodes largely inferred from these shapes, the images from different classes can be led to the same path. (a) Cup (b) Bottle Figure 5. PfN visual word inference from original words: The cylinder (red) in Fig. 2(b) is aligned to every interest point. A. Proposed approach using PfN vocabulary Our idea to handle this issue is to reduce the ambiguity in the path by enforcing shape similarity between the objects under the paths in Ψ and the test object j. Given the set of images I under the path ψ (ψ Ψ), we compute a normalized shape similarity sim(i, j) (i I ) and use it as a weight for the majority voting process, whereas the naïve approach gives an equal weight to all the images in I. Motivated by observation that the ambiguity arises when the discriminative patterns of the objects (e.g. green in Fig. 5) are not properly captured in the HSD, we characterize an object by a distribution representing the frequency over all visual words in the object, and define sim(i, j) as the distance of two distributions from object i and j. However, the distribution over the original visual vocabulary could be too sparse due to its large size (135,168 words). We thus design a PfN visual vocabulary that captures a pattern of local visual words. A PfN visual word represents the distance between two visual words w.r.t. surface smoothness pattern. It is a 7- dimension integer vector. Given two visual words w t and w n, the PfN visual word f tn is defined as: f tn [0] = tid, f tn [1] = h(w t [3], w n [3]),..., f tn [6] = h(w t [8], w n [8]), where h(x, y) is the Hamming distance between x and y, and w[x] is the value of x-th coordinate of the visual word w. The first coordinate tid is a cluster label that serves to link the PfN word to the original visual word w t. Since the original vocabulary is too large, we compute the clusters of the visual words under the same path and use

5 the cluster label as tid for each visual word. In total, the PfN vocabulary has 576 (9 2 6 ) visual words (9 clusters). For every w i,d, we compute all possible PfN visual words with its neighbors within the certain distance τ d. For instance, the interest points in blue region in Fig. 5 infer [t ]s from the neighbors in the same region and [t ]s from the neighbors in the red region. Then, we compute distribution F i over the PfN visual words, which represents the frequency of the words in object i: F i,fp = N i,f P N i,, where N i,fp is the number of the PfN visual word f P and N i, is the total number of PfN visual words inferred from object i. Finally, sim(i, j) is defined as the Bhattacharyya coefficient of F i and F j. (0 sim(i, j) 1) B. Classification process Given a new image j, we first identify a set of paths Ψ having most similar shape patterns with image j. For every path ψ (ψ Ψ), Step 1 : Let I be the objects under the path ψ and V I be all the visual words that the objects have. To assign a cluster id (i.e. tid), we initially group the visual words in V I into three clusters w.r.t. orientation (i.e. w[1]), and then apply k-mean clustering to the visual words in each initial cluster to group the visual words w.r.t. the surface smoothness (i.e. w[3] w[8]) In our experiments, k is set to 3. Every visual word in V I has the corresponding cluster label. Step 2 : Using the estimated cluster properties (center, dimension), assign the cluster label to every visual word in V j as well. Unless at least half of the visual words in V j are coupled with the clusters from V I, every object in I is discarded from the voting process. Step 3 : Compute sim(i, j) with every image i in I. Step 4 : Finally, the test object j is labeled as the object class which receives the majority of the weighted votes ( e.g. red box in Fig. 4). Extensive experiments (Table I in Sect. VII) demonstrate that our approach using the PfN vocabulary provides more stable and better labeling performance. Range images may have objects not present in the database. We threshold the shape similarity sim(i, j) to identify these new objects. During the classification process, if sim(i, j) is lower than a tolerance τ n, the test image j has no support from image i. If the test image is not supported by any of the training images under paths in Ψ, it is labeled as a new class in our framework. VI. SCALABLE APPROACH: ONLINE LEARNING Sect. IV describes the batch learning process to infer a HSD from a given existing dataset. However, in many applications, the HSD should grow over time and it is infeasible to run batch algorithms repeatedly everytime a new range image is added. For this reason, we propose here an online learning algorithm that incrementally updates the existing HSD given a new range image. As recent related work, [20] discussed several approaches to online inference for the LDA, which has a flat structure. Algorithm 1 outlines our online learning algorithm. Given the initial HSD already trained by the batch algorithm, the database efficiently learns a new object i by local update thanks to its tree structure. M i is the number of interest points in the new image i. Unlike our batch learning process that the topic, level and path assignments are iteratively sampled in turn, for local update, our online learning process starts from choosing a path ψ i where the image i should belong to, and assigns topic and level variables for every interest point in the image i by Gibbs sampling as batch learning does (line 2-5 in Algorithm 1). The path ψ i is either one of the existing paths or a new path. The path assignment process is discussed later. Then, we should also resample the previous topic and level assignments in the existing images R(i) (line 6-10 in Algorithm 1) associated with the path ψ i in order to update the distributions related to the path by conditioning on not only the existing training data, but also the new objecet. We explored several approaches to selection of R(i), which is the set of old images to be updated related to the path ψ i. Based on comparative experiments, to construct R(i), we randomly select the images from each path that shares the deepest branching-off internal node of the path ψ i. In our experiments, if the test object is a new object, the database is incrementally updated on the fly, then the updated database is used for the following classification process. For efficiency, our online learning process expands the existing hierarchical database only if there is a new instance of an existing class or a new class. A new class is obviously detected by the classification process. To identify a new instance of an existing class, we utilize the path assignment process as follows. Path assignment for new object class : The new object class should be assigned to a new path in the tree, and p(ψ i = ψ rest) (Sect.IV) serves to compute similarity that the object i belongs to new path ψ in the tree. We iteratively compute p(ψ i = ψ rest) and sample a path from it. But, computing similarity for every possible new path in the tree is inefficient, since p(ψ i = ψ rest) requires heavy gamma function computations. We thus specify the path candidate Ψ c. For new object class, Ψ c is defined as a set of new paths Ψ new (green lines in Fig. 6) sharing nodes with the paths in Ψ (red lines), i.e. Ψ c = Ψ new. Recall that Ψ is inferred from the classification process. After iterative path sampling on Ψ c, the image i is learned under the sampled path in the HSD with a new label through the process described in Algorithm 1. Online learning for new instance of existing class :

Algorithm 1 Pseudocode of online learning process 1: Determine path ψ i 2: for d = 1,, M i do 3: Sample z i,d using p(z i,d = z rest) 4: Sample l i,d using p(l i,d = l rest) 5: end for 6: for j in

6 Algorithm 1 Pseudocode of online learning process 1: Determine path ψ i 2: for d = 1,, M i do 3: Sample z i,d using p(z i,d = z rest) 4: Sample l i,d using p(l i,d = l rest) 5: end for 6: for j in R(i) do 7: for d = 1,..., M j do 8: Sample z j,d using p(z i,d = z rest) 9: Sample l j,d using p(l i,d = l rest) 10: end for 11: end for Figure 6. Path candidates: black line: existing paths, red line: selected existing paths with high similarity with the given new image, green line: new path candidates Even though the test object i is labeled as an existing class, the existing HSD might not capture the object i properly (considered as a new instance of the class in our framework), and it should be learned in the HSD for the following classification process. To verify whether the object is fully modeled in the existing HSD, the path assignment process is exploited. In this case, we construct the path candidate Ψ c to contain both the paths in Ψ (red lines in Fig. 6) and the new paths sharing nodes with paths in Ψ (green lines), i.e. Ψ c = Ψ new Ψ. After iterative path sampling, unless the final path sampled is one of the existing paths in the HSD, the object i is considered as a new instance and inserted into the HSD under that path by our online learning algorithm, as there are no paths characterizing the object i well in the HSD. VII. EXPERIMENTAL RESULTS To aqcuire a large number of range images for extensive evaluation, we used the number of 3D models freely available on the web, as it is hard to build a dataset containing range images of various objects only using real range sensors. We randomly selected various-shaped 3D models for 9 different object classes (bottle, car, chair, mug cup, desk lamp, lamp, monitor, phone and plane, 15 models / class) as a training dataset for HSD inference (Fig. VII). For every 3D model, we generated ten range images at random view points, and they are completely unlabeled. All the HSDs used for our experiments were trained from these 1350 synthetic range images through the process described in Sect. IV. The hyperparameters α, β, γ of Dirichlet distributions were always set to 1, 0.01 and 1, respectively, and Gibbs sampling was run for 200 iterations. Various [L, T ] pairs were verified, and the results with ten HSDs learned with [3, 30] are only presented due to space limit. Figure 7. 3D models in training dataset Table I COMPARISON ON CLASSIFICATION PERFORMANCE (UNIT: %) Category naïve w\ PfN AVG SD AVG SD Bottle Car Chair Cup Desk lamp Lamp Monitor Phone Plane A. Evaluation on synthetic data We generated 450 range images (50 images for each class) using 3D models different from the training dataset, and validated our system on these images. The test range images were also generated at random views. Classification performance Given the test images, the first experiment shows a performance comparison between the naïve approach and ours. For extensive comparison, we ran two approaches on 168 different HSDs learned with various [L, T ] pairs. The resultant average and standard deviation of the recall rates are given in Table I, which demonstrates that our approach improves and stabilizes the classification performance. On average, the recall and precision rates are increased by 16% and 14%, respectively. Finally, the confusion matrix in Table II exhibits the classification performance of our system using PfN vocabulary. The average recall rate is 94%. Incremental learning performance We also validated our incremental learning method in terms of labeling performance. We first infer the initial HSD by applying the batch learning process to the randomly selected training images, and then learn the rest of the images using our online learning process. Let {a, b} be a pair of the number of the images used in initial batch learning (= a) and the number of the images used in online learning (= b). As the quality of the final HSD is highly related to shape homogeneity in each path, it is verified based on how well the naïve approach categorizes the test images. Fig. 8 shows the labeling performance of the HSDs trained by our online inference algorithm for different {a, b} pairs. Each bar represents a mean recall rate, and the stick overlapping on the bar indicates standard deviation. Computation time Our batch learning took 11.7 hrs on average. Our labeling process is extremely fast. It took 16

Table II CONFUSION MATRIX(UNIT: %)(L = 3, T = 30) Bottle Car Chair Cup Desk lamp Lamp Monitor Phone Plane Bottle 99.6 ±1.26 0 0 0.4 ±1.26 0 0 0 0 0 Car 0 100 0 0 0 0 0 0 0 Chair 0 2 ±2.98 98 ±2.

6 ±1.34 98.4 ±3.23 0 0 Phone 0 4.2 ±3.45 0 1.6 ±1.83 1.4 ±3.13 0 0 87 ±10.63 5.8 ±9.72 Plane 0 0.4 ±0.84 0 0 0 0 0 8.6 ±9.47 91 ±9.58 Figure 8.

7 Table II CONFUSION MATRIX(UNIT: %)(L = 3, T = 30) Bottle Car Chair Cup Desk lamp Lamp Monitor Phone Plane Bottle 99.6 ± ± Car Chair 0 2 ± ± Cup Desk lamp 0.6 ± ± ± ± ± Lamp 4.2 ± ± ± ± ± Monitor 1 ± ± ± Phone ± ± ± ± ±9.72 Plane ± ± ±9.58 Figure 8. Online learning performance: red: {90,1260}, green: {450,900}, purple: {900,450} and blue: {1350,0} (a) SR3000 Figure 9. (b) Test objects Experimental environments 350 ms in our experiments. The system was tested on 3.0 GHz CPUs with 8GB of RAM. B. Evaluation on real range images Our scalable classification system is also validated on depth images acquired from the real-time range sensor, SwissRanger SR (Fig. 9(a)), which is a time-of-flight range sensor that produces a resolution image in real-time at 25 fps. Due to its low resolution, our test objects (Fig. 9(b)) should be small enough to be on the plate and captured very close to the sensor. For every image, we first segment object candidates (red, Fig. 10) by identifying the supporting planar surface (yellow), and then recognize its label using our method. Then, the path assignment process is applied to identify a new path that the object will be assigned to (if the object is classified as a new class) or to verify whether the object is a new instance of the existing class (if the object is classified as an existing class). The resultant paths are accumulated until the object is no longer visible in the scene. When the object disappears, the online learning process (Algorithm 1) inserts the object under the new path with the highest votes, if the object is classified as a new instance/class. 1 Fig. 10 shows an example of the HSD inferred after 1,433 depth images are processed. It demonstrates three different cases: the test object is classified as (1)an existing instance of the existing class (mug cup/paper cup/bottle/phone, black line), (2) a new instance of the existing class (espresso cup, blue line), (3) a new instance of a new class (duck, red line). When the duck is first shown in the scene, it is labeled as a new class and is learned into the HSD. Later, when it appears in the scene again, it is recognized as a duck (recall rate: ±0.03%). The confusion matrix on this dataset is given in Table III. The average correct classification rate is about 88.4% on the existing classes. The attached supplementary video displays the segmentation and identification results, and the computational time for every image. VIII. CONCLUSION We have presented a scalable framework that categorizes 3D objects in range images and expands to handle new data. For fast labeling and online inference, we employ a hierarchical model of object classes. Its tree structures and distributions are automatically inferred from given range images in an unsupervised manner. Our labeling approach using PfN visual vocabulary improves the performance, and the online inference process recognizes a path which corresponds to the new data and updates the part of the tree associated with the path. ACKNOWLEDGMENT This work is supported by DARPA under the URGENT program. The content of this paper is approved for public release, distribution unlimited. REFERENCES [1] T. Funkhouser, P. Min, M. Kazhdan, A. H. Joyce Chen, D. Dobkin, and D. Jacobs, A search engine for 3D models, ACM ToG, vol. 22, no. 1, pp , [2] A. E. Johnson and M. Hebert, Using spin images for efficient object recognition in cluttered 3D scenes, TPAMI, vol. 21, no. 5, pp , [3] A. S. Mian, M. Bennamoun, and R. Owens, 3-D modelbased object recognition and segmentation in cluttered scenes, TPAMI, vol. 28, no. 10, pp , 2006.

8 Figure 10. Example of the HSD: Each node represents the discrete probabilistic distribution π c (unit: 0.2) at each node c. Due to the limited space, we only show the paths the test objects belong to, and the paths which share the node with new paths and have the most training images among the branches. For every existing path, the example training object under the path is displayed in the purple box. Table III CONFUSION MATRIX WITH REAL RANGE IMAGES (UNIT: %) Bottle Car Chair Cup Desk Lamp Lamp Monitor Phone Plane New Mug Cup 4.2 ± ± Bottle Paper Cup 9.48 ± ± ±0.02 Espresso ± ±3.08 Duck ± ±0.04 Phone ± ± ± [4] B. Drost, M. Ulrich, N. Navab, and S. Ilic, Model globally, match locally: Efficient and robust 3d object recognition, in CVPR, 2010, pp [5] E. Bart, I. Porteous, P. Perona, and M. Welling, Unsupervised learning of visual taxonomies, in CVPR, 2008, pp [6] P. Daras and A. Axenopoulos, A 3D shape retrieval framework supporting multimodal queries, IJCV, vol. 89, no. 2, pp , [7] P. Papadakis, I. Pratikakis, T. Theoharis, and S. Perantonis, Panorama: A 3d shape descriptor based on panoramic views for unsupervised 3d object retrieval, IJCV, vol. 89, no. 2, pp , [8] F. Stein and G. Medioni, Structural indexing: efficient 3-D object recognition, PAMI, vol. 28, no. 10, pp , [9] S. Ruiz-Correa, L. G. Shapiro, and M. Meila, A new signature-based method for efficient 3D object recognition, in CVPR, [10] X. Li and I. Guskov, 3D object recognition from range images using pyramid matching, in ICCV, 2007, pp [11] D. Huber, A. Kapuria, R. Donamukkala, and M. Hebert, Parts-based 3D object classification, in CVPR, 2004, pp [12] S. Ruiz-Correa, L. G. Shapiro, and M. Meila, A new paradigm for recognizing 3D object shapes from range data, in ICCV, [13] A. Golovinskiy, V. G. Kim, and T. Funkhouser, Shape-based recognition of 3D point clouds in urban environment, in ICCV, [14] D. Murray and J. J. Little, Patchlets: Representing stereo vision data with surface elements, in WACV, [15] C. Wang, H. Tanahashi, H. Hirayu, Y. Niwa, and K. Yamamoto, Comparison of local plane fitting methods for range data, in CVPR, [16] G. Medioni, M.-S. Lee, and C.-K. Tang, A Computational Framework for Segmentation and Grouping. New York, NY, USA: Elsevier Science Inc., [17] A. Y. Ng and M. I. Jordan, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, in NIPS, [18] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet allocation, JMLR, vol. 3, pp , January [19] D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum, Hierarchical topic models and the nested Chinese restaurant process, in NIPS, 2004, pp [20] K. R. Canini, L. Shi, and T. L. Griffiths, Online inference of topics with latent Dirichlet allocation, in AISTATS, 2009.

Scalable Object Classification using Range Images

Scalable Object Classification using Range Images Eunyoung Kim and Gerard Medioni Institute for Robotics and Intelligent Systems University of Southern California 1 What is a Range Image? Depth measurement