ThreeDimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients Authors: Zhile Ren, Erik B. Sudderth Presented by: Shannon Kao, Max Wang October 19, 2016
Introduction Given an image of an realistic indoor scene, how do you classify objects in them?
Goals Indoor scene understanding: To develop new representations and algorithms for 3D object detection and spatial layout prediction in cluttered indoor scenes Main challenge: Images of indoor (home or office) environments are typically highly cluttered and have substantial occlusions
Previous Work: CAD Detection: CAD models for learning object shapes and alternative viewpoints But lack a surplus of models Models do not cover different classes of the same object type Computationally inefficient M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. Seeing 3D chairs: Exemplar partbased 2D3D alignment using a large dataset of CAD models. In CVPR, 2014. J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing IKEA objects: Fine pose estimation. In ICCV, 2013 J. J. Lim, A. Khosla, and A. Torralba. FPM: Fine pose partsbased model with 3D CAD models. In ECCV, pages 478 493. Springer, 2014
Previous Work Layout proposal: Manhattan structure to infer 2D projections of the 3D structure Integral representation to explore exponentially many layout proposals Previous work focused on restricted environment and does not generalize to cluttered scenes Manual heuristics to reduce scene parsing false positives Scalability issue J. M. Coughlan and A. L. Yuille. Manhattan world: Compass direction from a single image by Bayesian inference. In ICCV, volume 2, pages 941 947. IEEE, 1999 Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao. 3D shapenets for 2.5D object recognition and nextbestview prediction. arxiv preprint arxiv:1406.5670, 2014 S. Song and J. Xiao. Sliding shapes for 3D object detection in depth images. In ECCV, pages 634 651. Springer, 2014.
Proposed Solution Representation Geometric features Clouds of oriented gradients (COG) Novel Manhattan voxel structure (layout) Training Structured SVM (on cuboid + layout) Cascaded classification framework
Proposed Solution Representation Geometric features Clouds of oriented gradients (COG) Novel Manhattan voxel structure Training Structured SVM (on cuboid + layout) Cascaded classification framework
RGBD to Voxel Features
Geometric Features Point cloud density: Using 3D Density with a 3 = iℓ Niℓ / Aiℓ 3D Normal Orientations: Find normal orientation for each 3D point using plane fit with 15 nearest neighbors
Clouds of Oriented Gradients (COG) Computes gradients on the RGB channels of the 2D image Applies filters: Maximum responses across color channels are gradients (dx, dy) in the x and y directions, with magnitude:
Clouds of Oriented Gradients (COG) Uses nine 3D orientation bins from 0 to 180 degrees Uses perspective projection to find corresponding 2D bin boundaries
COG Normalization and Aliasing Bilinearly interpolates gradient magnitudes between neighboring orientation bins For a small ϵ > 0 Dimension of COG: 63 x 9 = 1944
Room Layout Geometry: Manhattan Voxels Room layout prediction: floor, ceiling, wall Discretize vertical space between floor and ceiling into 6 equal bins Threshold of 0.15m to separate points near walls from hypothesized layout Use diagonal lines to split bins at room corners to create 12 x 6 = 72 bins
Manhattan voxels cont. Regions: 14: Scene interior, where objects could be placed anywhere (point cloud distribution varies widely) 58: Model points near assumed Manhattan wall structure. Here, 5 and 6 contain orthogonal planes. 912: Points outside of predicated layout
Proposed Solution Representation Geometric features Clouds of oriented gradients (COG) Novel Manhattan voxel structure Training Structured SVMs (on cuboid + layout) Cascaded classification framework
Cuboid Detection Cuboid i = (Ii, Bi) = { a iℓ a iℓ, b iℓ, c iℓ }216ℓ=1 point cloud density feature b iℓ 25 surface normal histogram features c 9 COG features iℓ Find prediction function hc : I B B = (L,, S) L center of cuboid in 3D cuboid orientation S physical size of cuboid
Cuboid Detection Training: nslack formulation of structural SVM n Ii Bi C number of categories input image for cuboid i bounding box for cuboid i constant
Cuboid Detection Loss function: B B 3D bounding box ground truth bounding box orientation with respect to ground ground truth orientation
Cuboid Hypothesis Cuboid hypotheses calculated using sliding window Width quantiles {0.1, 0.3, 0.5, 0.7, 0.9} Depth quantiles {0.25, 0.5, 0.75} Height quantiles {0.3, 0.5, 0.8} All combinations of voxel size, 3D location, and orientation (from 16 candidate orientations) is evaluated.
Layout Detection M = (L,, S) Trained using the same SSVM method, with freespace definition of IOU as loss, where ground truth is hypothesis with largest freespace IOU
Layout Hypothesis Layout hypotheses must capture 80% of candidate points. Floors and ceilings predicted at 0.001 and 0.999 quantiles of 3D points (along gravity direction). 5,000 20,000 hypotheses for a typical scene
Learning Spatial Context Problem: Portion of large object detected as smaller object
Learning Spatial Context Problem: Portion of large object detected as smaller object Solution: Cascaded classification
Evaluation
Experiment Setup Dataset: SUN RGBD Parameters Compared with: sliding shape, baseline layout, HOG 10 object categories Performance Metrics Cuboid performance evaluated using IOU with ground truth cuboids Layout performance evaluated using freespace IOU with human annotations
Experiment Results
Experiment Results Precision scores for 10 object categories
Experiment Results
Summary Novel Representations Cloud of oriented gradients (COG) for cuboids Manhattan voxels for layouts Uses RGBD data, does not rely on CAD model information Learning Objects classified using SSVM Cascaded learning framework applied to remove false positives
Q&A
Backup
Cascaded Classification Firststage detection becomes input features to secondstage classifiers that estimate confidence Essentially a directed graphical model with hidden variables. Marginalizing the firststage variables recovers a standard, fullyconnected undirected graph. More efficient: Training decomposes into independent learning problems for each node (object category) Optimal test classification is possible via a rapid sequence of local decisions
Cascaded Classification First stage Outputs layout, set of {bounding box, confidence score, object category} Second stage Add contextual features: Objectobject overlap: Objectlayout context: distance and angle to nearest wall
Learning Spatial Context Training Standard SVM with radial basis function (RBF) kernel Binary classification: true or false positive Prediction Secondstage classifier outputs new contextual confidence Overall confidence is sum of first and second stages