Acquiring semantically meaningful models for robotic localization, mapping and target recognition.

Size: px

Start display at page:

Download "Acquiring semantically meaningful models for robotic localization, mapping and target recognition."

Candice Franklin
5 years ago
Views:

1 Acquiring semantically meaningful models for robotic localization, mapping and target recognition. Principal Investigators: J. Košecká Project Abstract The goal this proposal is to develop novel representations and techniques for localization, mapping and target recognition from videos of indoors and urban outdoors environments. The proposed techniques would facilitate enhanced navigation capabilities by means of visual sensing and enable scalable, long-term navigation and target detection and tracking in outdoors and indoors environments. The attained representations will also be applicable towards human-robot interaction, enhancement of human navigational and decision making capabilities and provide compact semantically meaningful summaries of the acquired sensory experience. The proposed representations will be governed by principles of compositionality, facilitate bottom-up learning, enable efficient inference and could be adapted to a task at hand. The main novelty of the proposed approach will be the use both 3D and 2D geometric and photometric cues computed either from video sequence or from novel RBG-D cameras, which provide synchronized video and range data at frame rate. Video poses challenges related to more extreme variations in viewpoint and scale, dramatic changes in lighting and large amount of clutter and occlusions, but also enables computation of 3D structure and motion cues, which can aid segmentation and recognition of object and non-object categories. The proposed work can be partitioned into three main research topics: 1. Development of novel matching and alignment strategies of RGB-D video streams, which facilitate robust localization, loop-closing and mapping of indoors environments. 2. Development of robust visual mapping and localization algorithm, which will enable semantic labeling of outdoors environments using photometric and geometric cues from video; 3. Development of novel features and representations for weakly supervised and self supervised strategies for learning models of objects and non-object categories from video using 3D geometric cues and photometric cues. 4. Development of active strategies for model acquisition and target detection strategies exploiting both geometric and photometric cues. The proposed techniques will be evaluated extensively in the context on unmanned mobile vehicles, where additional sensing and control authority is available as well as in the context of wearable computing, where the sole goal is that of model building.

Figure 1: Examples of (a) sparse point cloud representation; (b) original panorama and (c) dense piecewise planar reconstruction of an urban street.

.. transition While maps comprised of dense or sparse cloud of points are often suitable for accurate localization, they are often insufficient for more advanced planning and reasoning tasks.

] the semantic labels were either associated with individual locations, such as kitchen, corridor, printer room or individual image regions [?]. In majority of the approaches the features used to infer different semantic categories were derived from both 3D range data and photometric cues.

2 Figure 1: Examples of (a) sparse point cloud representation; (b) original panorama and (c) dense piecewise planar reconstruction of an urban street. 1 Statement of Objectives 2 Research Challenges and the State of the Art JK... transition While maps comprised of dense or sparse cloud of points are often suitable for accurate localization, they are often insufficient for more advanced planning and reasoning tasks. Several recent efforts for building semantic maps were considered in robot mapping applications. In work of [?] the semantic labels were either associated with individual locations, such as kitchen, corridor, printer room or individual image regions [?]. In majority of the approaches the features used to infer different semantic categories were derived from both 3D range data and photometric cues. In outdoors settings the final semantic labelling problem has been formulated as MAP assignment of labels to image regions in Markov Random field framework [?]. The example labels include road, building, pedestrians, sky, and trees. Additional aspects of the proposed problems studied in isolation include label propagation strategies and use of sparse set of 3D features guide the semantic labeling [?,?,?]. In work of [?] authors demonstrated that even the use of simple geometric features can lead to successful semantic segmentation of various semantic categories. The scale of these experiments was rather small, they proceeded in a fully supervised setting and have been mostly applied in indoors setting. Alternative strategies for endowing environment with additional information about rooms, use directly the occupancy grids [?]. These approaches discard the original images after building a map, which we believe throws away many useful cues not captured in the map. In previous work [?] we have used the constraints of urban environments to reconstruct in the multi-view stereo setting the 3D models of challenging imagery, captured from an moving vehicle in unconstrained and difficult lighting conditions. The obtained reconstructions correctly captured piecewise planar structure of the environments, but were computationally expensive and did not attain more informative semantic labels (see Figure 1). Several authors have recently demonstrated impressive single view reconstruction systems. In [?] pose the problem as a multiclass segmentation problem, with labels corresponding to 3D geometry, while authors in [?] reason simultaneously about geometry and object labels.

Figure 2: a) example environment (b) most likely labelling (classes are color coded: car, bus, sidewalk, street, building) (c) partitioning image into superpixels consistent with occlusion boundaries.

This problem is referred to as problem of semantic segmentation and deals with issues of simultaneous segmentation and recognition of different object and non-object categories in the image.

3 Figure 2: a) example environment (b) most likely labelling (classes are color coded: car, bus, sidewalk, street, building) (c) partitioning image into superpixels consistent with occlusion boundaries. 2.1 Related Work on Target Recognition and Semantic Segmentation The second research theme will strive to associate more detailed semantic labels, with individual image regions and video sequence. This problem is referred to as problem of semantic segmentation and deals with issues of simultaneous segmentation and recognition of different object and non-object categories in the image. Example of desired semantic segmentation of street scene is in Figure 2.1. The interplay between processes of segmentation and recognition has gained increased momentum in the field of computer vision, due to recent advances in fully supervised object recognition strategies and well as improvements in fully unsupervised segmentation algorithms. Majority of the existing approaches proceed with object recognition, semantic parsing is a fully or weakly supervised setting, where full pixel-wise segmentations or bounding boxes are used to train discriminative classifiers, where majority of the datasets consider single background category or have images with small number of objects and clutter (e.g. VOC Pascal, MRSC21). Examples of approaches which consider the object and scene recognition simulatenously have been considered in [?], where output of sliding window object detectors was combined with traditional semantic segmentation strategies. Authors in [?] emphasized the importance of contextual relationships and demonstrated that in the presence of strong contextual cues, even weaker representations of objects are sufficient and not all object categories have to be sought for. Additional efforts related to discovery of geometric semantic labels and contextual cues from single image have been proposed by [?], demonstrating the role of contextual cues for object detection played by geometry. Extensive review of the use and the types of contextual relatinships can be found in [?]. The existing work on multi-class segmentation typically differs in the choice of elementary regions for which the labels are sought, the types of features which are used to characterize them, means of integrating the spatial information and techniques for learning and inference given the model. In [?,?,?,?] authors used larger windows or superpixels, which are characterized by specific features such as color, texture moments or histograms, dominant orientations, shape, where the likelihoods of the observations are typically obtained in discriminative setting. 2

Figure 3: Class-class co-occurrence matrices with example images for the 11-class CamVid and 21-class MSRC21 dataset.

simultaneously. On the other hand there are existing approaches to object detection and recognition, which consider the object recognition problem separately.

The representations used for object detection and recognition and dominated by successful sliding-window detectors [?,?

The class of more structured models has been inspired by grammar based models, which generalize deformable parts models by representing objects using variable

] in experimental settings, they have been typically outperformed by simpler models [?]. Recent advances for building more structured but at the same time effective models using mixtures of multi-scale deformable models have proposed in [?

Each part captures local appearance properties, while deformations are characterized by spring like connections between certain pairs of parts. 2.

learning models of desired categories and the existing approaches are evaluated on commonly used object detection and recognition or semantic segmentation datasets (e.g. PASCAL VOC, Microsoft MSRC21 dataset).

4 Figure 3: Class-class co-occurrence matrices with example images for the 11-class CamVid and 21-class MSRC21 dataset. Rows and columns of the matrices correspond to class labels and the numbers stand for class-class co-occurrence frequency in the datasets. White color stands for zero occurrence. Notice the sparsity of the matrix for the MSRC dataset, meaning that usually only 2-5 objects are present in the image while in the CamVid usually all 11 objects appear simultaneously. On the other hand there are existing approaches to object detection and recognition, which consider the object recognition problem separately. The existing representations, datasets and strategies for evaluating these vary. The representations used for object detection and recognition and dominated by successful sliding-window detectors [?,?], which have been used with success in detection and recognition of objects which subtend rectangular window (e.g. faces, cars, pedestrians). The class of more structured models has been inspired by grammar based models, which generalize deformable parts models by representing objects using variable hierarchical structures. While foundations of these types of models have been laid out some time ago [?,?] in experimental settings, they have been typically outperformed by simpler models [?]. Recent advances for building more structured but at the same time effective models using mixtures of multi-scale deformable models have proposed in [?]. These models represent objects as collections of parts arranged in deformable configuration. Each part captures local appearance properties, while deformations are characterized by spring like connections between certain pairs of parts. 2.2 Challenges of Target Recognition and Semantic Segmentation The semantic segmentation as typically studied in computer vision uses fully supervised approach for learning models of desired categories and the existing approaches are evaluated on commonly used object detection and recognition or semantic segmentation datasets (e.g. PASCAL VOC, Microsoft MSRC21 dataset). The images in these data sets have mostly small number (2-5) of object/background classes in the image and there is typically one object in the center, which takes dominant portion of the image, see Figure??. 3 Proposed Research The common theme of the proposed research is to develop novel representations of objects and environments using photometric and geometric cues which will be suited for more robust and scalable localization, target detection and semantic mapping of unmanned mobile and aerial vehicles. The proposed representations, could be also utilized for better human-robot interaction and be used 3

GPS visual (a) (b) Figure 4: Pose estimation. (a) Trajectory estimated by our method from images (blue) and by GPS (green) visualized in the Google Maps by the GPSVisualizer.

(b) Three images captured along the trajectory at places marked in the map to the left.

Next we briefly outline some of our preliminary work and discuss in more details proposed directions of research, which tackle the challenges outlined in the previous section. 3.

multi-view reconstruction suitable for recovering geometry in difficult indoors and outdoors environments [?].

5 GPS visual (a) (b) Figure 4: Pose estimation. (a) Trajectory estimated by our method from images (blue) and by GPS (green) visualized in the Google Maps by the GPSVisualizer. Our trajectory is put into the world coordinate system by aligning first two estimated poses in the sequence with the GPS. (b) Three images captured along the trajectory at places marked in the map to the left. as basic for development of tool for parsing and browsing large quantities of videos acquired in urban outdoors and indoors environments. Next we briefly outline some of our preliminary work and discuss in more details proposed directions of research, which tackle the challenges outlined in the previous section. 3.1 Preliminary work Localization and Mapping In our previous work we have developed several approaches for motion recovery and reconstruction of sparse set of 3D features as well as dense 3D multi-view reconstruction suitable for recovering geometry in difficult indoors and outdoors environments [?]. For the problem of visual odometry, we employed omnidirectional images and the techniques based on matching robust scale invariant features, followed by computation of epipolar geometry and batch-based non-linear optimization. The omnidirectional panoramas were composed from four perspective images covering in total 360 deg horizontally and 127 deg vertically. For pose estimation we used matching of SURF features [?] between each consecutive image pair along the sequence and exploited favorable properties of omnidirectional cameras for motion estimation. The spherical representation of the omnidirectional image allows us to construct corresponding 3D rays p,p for established tentative point matches u u. The tentative matches were validated through RANSAC-based epipolar geometry estimation formulated on their 3D rays, i.e. p Ep = 0, yielding thus the essential ma- 4

6 trix E = ˆTR [?]. Treating the images as being captured by a central omnidirectional camera was beneficial in many aspects. As the 3D rays are spatially well distributed and cover large part of a space, it results in very stable estimate of the essential matrix, studied in [?]. Moreover, improved convergence of RANSAC was achieved by rays sampled uniformly from each of four subparts of the panorama. The large field of view (FOV) especially contributed towards better distinguishing of a rotation and translation obtained from the essential matrix. The scale of the translation was estimated by a linear closed-form 1-point algorithm on corresponding 3D points triangulated by DLT [?] from the previous image pair and the current one. The strategy differed from a well known and often used technique [?] where full pose of the third camera, i.e. the translation, its scale, and the rotation, is estimated through the 3-point algorithm. Figure?? shows an example of the pose estimation in outdoors setting and is compared to GPS data. Notice that the GPS position can get biased or can contain small jumps, especially when satellites in one half of the hemisphere are occluded by nearby buildings. Furthermore the GPS itself does not provide rotation information unless it is equipped with a compass, making the visual odometry a suitable complement to the GPS. 3.2 Preliminary work on semantic labeling We have recently completed some preliminary work on semantic labeling of street scenes [?], where we explored the use on non-parametric methods based on large visual vocabularies of image features computed over small superpixels for representation and segmentation of object and background categories. We showed how the effort to capture the local spatial context of visual words is of fundamental importance, providing interesting insight into part based representation. These ingredients along with novel means of integrating spatial relationships were integrated in a probabilistic framework yielding a second-order Markov Random Field (MRF), where the final labeling is obtained as a MAP solution of the labels given an image. The labeling L was estimated as a Maximum Aposteriori Probability (MAP), argmax L P(L V) = argmaxp(v L)P(L). (1) L where L are the desired labels and V are photometric features. The above MAP inference problem can in turn be converted into following energy minimization problem: argmin L ( S E app + λ s i=1 (i,j) E E smooth ). (2) In our problem domain, we considered 11 labels of object and non-object categories (e.g. sky, road, building, tree, pavement, pedestrian, car, sign, column pole etc) and the experiments were carried out on the data set of 450 images of street scenes. The key challenge of the above formulation is the choice of the individual terms of the energy function and their spatial interactions. We propose to enhance the existing approaches in several ways. 5

Figure 5: (a) An example of street scene view and (b) sparse set of 3D points recovered from video of street scene.

References [1] Special issue on vision. IEEE Transactions on Robotics, 2008. [2] Torralba A., Murphy K., and Freeman W. Contextual models for object detection using boosted random fields.

Computer Vision and Image Understanding (CVIU), 110(3):346 359, 2008. 4 [4] A. Berg, F. Grabler, and J. Malik. Parsing images of architectural scenes. In ICCV, 2007.

Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88 97, 2009. 1 [7] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla.

7 Figure 5: (a) An example of street scene view and (b) sparse set of 3D points recovered from video of street scene. Even coarse triangulation reveals 3D structural properties of the scene; (c) coarse semantic segmentation into 11 different object and background categories; (d) preliminary results of our method. References [1] Special issue on vision. IEEE Transactions on Robotics, [2] Torralba A., Murphy K., and Freeman W. Contextual models for object detection using boosted random fields. In Advances in Neural Information Processing Systems 17, pages [3] H. Bay, A. Ess, T. Tuytelaars, and L.J. Van Gool. Speeded-up robust features (SURF). Computer Vision and Image Understanding (CVIU), 110(3): , [4] A. Berg, F. Grabler, and J. Malik. Parsing images of architectural scenes. In ICCV, [5] Tomas Brodsky, Cornelia Fermüller, and Yiannis Aloimonos. Directions of motion fields are hardly ever ambiguous. IJCV, 1(26):5 24, [6] G. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88 97, [7] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, pages I: 44 57, [8] P. Buschka and A. Saffioti. A virtual sensor for room detection. In IEEE Intelligent Robots and Systems, [9] Canesta. In [10] M. Chandraker, J. Lim, and D. Kriegman. Moving in sterei: Efficient structure and motion using lines. In ICCV, [11] M. Cummins and P. Newman. Highly scalable appearance only slam - fab-map 2.0. In RSS, [12] A. Davison, I. Reid, N. Molton, and O. Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Recognition and Machine Intelligence, [13] J. Yuenn et al. Labelme video. In ICCV, [14] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions of Pattern Analysis and Machine Intelligence,

8 [15] C. Galleguillos and S. Belongie. Context based object categorization: A critical survey. Computer Vision and Image Understanding (CVIU), 114: , [16] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In CVPR, [17] S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In NIPS, [18] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, [19] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. Rgb-d mapping: Using depth cameras for dense 3d modeling of indoor environments. In ISER. [20] D. Hoeim, A. Efros, and M. Hebert. Geometric context from single image. In ICCV, , 2 [21] Shotton J., Winn J., Rother C., and Criminisi A. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1):2 23, [22] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In IEEE Int Symp on Mixed and Augmented Reality,, [23] K. Konolige. Projected texture stereo. In IEEE International Conference on Robotics and Automation. [24] K. Konolige, M. Colander, J. Bowman, P. Michelich, JD CHen, P. Fua, and V. Lepetit. View-based maps. In RSS, [25] J. Kosecka and W. Zhang. Video compass. In ECCV, [26] J. Leonard M. Bosse, R. Rikoski and S. Teller. Omnidirectional structure from motion using vanishing points and 3d lines. In Visual Computer, volume 19, [27] B. Mičušík and J. Košecká. Multi-view superpixel stereo in man-made environments. International Journal of Computer Vision, [28] B. Mičušík and J. Košecká. Piecewise planar city modeling from street view panoramic sequences. In CVPR, , 4 [29] B. Mičušík and J. Košecká. Semantic segmentation of street scenes by superpixel co-occurrence and 3d geometry. In IEEE Workshop on Video-Oriented Object and Event Classification, [30] Dalal N. and Triggs B. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition, [31] David Nistér. Preemptive RANSAC for live structure and motion estimation. Machine Vision Application (MVA), 16(5): , [32] Kohli P., Ladicky L., and Torr P. Robust higher order potentials for enforcing label consistency. In CVPR, [33] Ian Reid Paul Smith and Andrew J. Davison. Real-time monocular slam with straight lines. [34] M. Posner and P. Newman. Learning hierarchical models of scenes, objects, and parts. In Robotics and Autonomous Systems, pages 56(11): ,

9 [35] PrimeSense. In [36] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, [37] B. Russell, A. Torralba, C. Liu, and R. Fergus. Object recognition by scene alignment. In NIPS, [38] Gould S., Rodgers J., Cohen D., Elidan G., and Koller D. Multi-class segmentation with relative location prior. IJCV, 80(3): , [39] S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlatons. In CVPR, pages , [40] S. Se, D. Lowe, and J. Little. Mobile robot localization and mapping with uncertainty using scaleinvariant features. International Journal of Robotics Research, pages , [41] G. Singh and J. Kosecka. Visual loop closing using gist descriptors in manhattan world. In Proc. of IEEE Int. Conf. on Robotics and Automation, Workshop on Omnidirectional Vision, [42] C. Stachniss, O. Martnez-Mozos, A. Rottmann, and W. Burgard. Semantic labelling of places. In International Symposium on Robotics Research, [43] E. Sudderth, A. Torralba, W. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, [44] M. Tomono. Robust 3d slam with a stereo camera, based on an edge-point icp algorithm. In IEEE Int Conference on Robotics and Automation, [45] A. Vazquez-Reina, S. Avidan, H. Pfister, and E. Miller. Multiple hypothesis video segmentation from superpixel flows. In ECCV, [46] P. Viola and M. Jones. Robust real-time object detection. IJCV, 2(57): , [47] J. Xiao and L. Quan. Multiple view semantic segmentation for street view images. In ICCV, [48] Y. Yin and S. Geman. Context and hierarchy in a probabilistic image model. In CVPR, [49] S. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4): ,

Acquiring Semantics Induced Topology in Urban Environments

2012 IEEE International Conference on Robotics and Automation RiverCentre, Saint Paul, Minnesota, USA May 14-18, 2012 Acquiring Semantics Induced Topology in Urban Environments Gautam Singh and Jana Košecká