Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Allan Zelener Dissertation Proposal December 12 th 2016 Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Overview 1. Introduction to 3D Object Identification 2. Completed Work Part-based Object Classification of Vehicle Point Clouds. CNN-based Object Segmentation in LIDAR with Missing Points. 3. Proposed Work Joint localization, segmentation, classification, and 3D pose estimation. Depth-sensitive localization. Depth-sensitive subpixel methods for segmentation. Spatial transformers for pose estimation. Domain adaptation and shape completion from synthetic data. Timeline for completion.

Identifying 3D Objects Real world objects have a 3D shape and a position in a 3D scene. Objects may be oriented with respect to some reference pose. These object properties are associated with their semantic class.

Identifying 3D Objects

Identifying Objects in 2D Images Fei-Fei, Karpathy, Johnson (http://cs231n.stanford.edu/slides/winter1516_lecture13.pdf)

Identifying 3D Objects in 2D Images 3D oriented CAD models mapped to 2D image regions. Approximate 3D shape based on selected models. Relative 3D position and scale may still be ambiguous. Visual perspective cues required to estimate object properties. Yu et al., ObjectNet3D: A Large Scale Database for 3D Object Recognition

Identifying 3D Objects in 3D Images 3D sensors provide accurate pointwise depth measurements. Object position and scale can be determined from a single 3D image. Song et al., SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite

Challenges in 3D Images Manual Labeling of 3D Point Cloud Missing measurements due to sensor properties. Partial 3D data based on limited viewpoints. Difficult large-scale annotation compared to 2D images. Feature representations for 3D properties.

Completed Work Classification of Vehicle Parts in Unstructured 3D Point Clouds RANSAC point clustering for planar parts. Part-based structured model for classifying parts and overall object class. Classification of Vehicle Parts in Unstructured 3D Point Clouds, Zelener, Mordohai, and Stamos, 3DV, 2014.

Local Feature Extraction Density weighted spin images. Dense sampling of keypoints on a uniformly spaced voxel grid. Normals oriented away from center of object centroid. K-means clustering to generate bag-of-words codebook. Baseline object descriptor is normalized count vector of codebook features. K-Means Spin Image Codebook k = 50

Automatic Part Segmentation Iterative RANSAC plane fitting. Candidate planes from faces of convex hull. Robust re-estimation of planes using PCA. For vehicles, five planar parts cover most of the surface Convex Hull Examples Colored by Segmentation Order

Part-Level Features Spin image bag-of-words. Average height ഥh. Horizontal/vertical indicator I n = 0, if nt z > cos π 4 1, otherwise Mean, median, and max of plane fit errors. Eigenvalues from plane fitting λ 1, λ 2, λ 3 (in descending order). Linearity (λ 1 λ 2 ) and Planarity λ 2 λ 3.

Pairwise Part Features Dot product of normals, n 1 T n 2 Absolute difference in average heights, h 1 h 2 Distance between centroids, c 1 c 2 Closest distance between points, min i P 1,j P 2 p 1,i p 2,j Coplanarity as mean, median, and max cross-plane fit errors.

Structured Part Modeling Generalized HMM as sequence of parts and final class variable. Trained discriminatively by structured averaged perceptron. Parts reordered in sequence based on I(n) and average height. a 1 a 2 a n c x 1 x 2 x n x 1 x 2 x n

Experimental Results for Part Classification Evaluation on Ottawa dataset with 155 sedans and 67 SUVs. Structured part modeling provides increased performance for part classification. Manual segmentation provides increase for classification of all parts per object. Part Classification Comparison

Experimental Results for Object Classification SP gives significant gains over baseline perceptron model. Manual segmentation with SP exceeds unstructured baselines. Sedan vs SUV Object Classification No Part Segmentation Part Segmentation

Comparison Between Automatic and Manual Segmentation Under-segmentation from unbounded plane fitting. Merged semantic part classes like roof-hood and roof-trunk. Inconsistent labeling behavior at boundaries and noisy points. Automatic Manual

Conclusions for Part-based Classification PROS RANSAC segmentation is robust to many complexities of 3D data. Structured part-based method shows improvement over bagof-words with local features. Pairwise features based on geometric properties improve classification performance. CONS RANSAC segmentation is not equivalent to semantic segmentation. Labeling ground truth parts for every possible object class may be infeasible. RANSAC segmentation, features, and structure model are determined before training the classifier.

CNN-Based Object Segmentation Segmentation on LIDAR scanning grid with missing points. CNN training procedure for LIDAR data. CNN-based features extracted from small set of initial feature maps for 3D images. CNN-Based Object Segmentation in Urban LIDAR with Missing Points, Zelener and Stamos, 3DV, 2016.

Missing Points in LIDAR Contiguous LIDAR scanlines form 2.5D grid of scanner measurements. Laser reflection causes missing points on objects in the grid. We can label and infer over these positions. Missing Points in Gray on Scanning Grid Missing Points on Vehicles are Labeled

Preprocessing Pipeline Sample positive and negative locations in large LIDAR scene piece. Extract M M patch as input to CNN. Predict labels for central K K region, K M. (M = 64, K = 8)

Initial Feature Maps Compute normalized feature maps from 3D points in M M patch. Assume ~N(0,1) truncated to [ 6, 6] within each patch. Missing data given max value (6) in clip range. 6 0 Relative Depth Relative Height -6

Initial Feature Maps Angle and missing mask describe sensor properties. Angle normalized as before and missing mask in {0,1}. 6 1 0-6 Angle Missing Mask 0

Initial Feature Maps Signed Angle from Hadjiliadis and Stamos. 3DPVT 2010. 6 z v 2 p 3 0-6 Signed Angle v 1 SignedAngle p 2 p 1 p 2 Scanning Direction = acos( zƹ v 2 ) sgn v 1 v 2 Horizontal surfaces at 90 degrees. Vertical surfaces at 0 degrees. Sharp changes yield negative sign.

L x, y Model Overview Baseline CNN architecture. ReLU nonlinear activation functions. L2-regularization on affine layers. Dropout regularization on final layer. Predict binary label for each point in the K K target. Total model loss is = K 2 [y k log p k + (1 y k )log (1 p k )] + λ k=1 Binary Cross Entropy L 2 l=1 W l 2 2 L2-Regularization Input Patch Conv 5 5 Conv 5 5 Affine Affine Output Labels (64, 64, 5) (32, 32, 32) (16, 16, 64) 512 64 (= K 2 )

Results from Vehicle Point Detection using CNN [patch size = 64 x 64, target size = 8 x 8] nyc_0 (in-sample) test piece nyc_1 test piece True Positive Yellow True Negative Dark Blue False Positive Cyan False Negative Orange

True Positive Yellow True Negative Dark Blue Nyc_0 (In-sample) Test Recall.85, Precision.73 False Positive Cyan False Negative Orange

True Positive Yellow True Negative Dark Blue Nyc_1 Test Recall.85, Precision.73 False Positive Cyan False Negative Orange

Experimental Results Input Feature Map Comparison D Depth, H Height, A Angle, S Signed Angle, M - Missing Mask

Impact of Using Missing Point Labels DHASM with Missing Point Labels DHASM with No Missing Point Labels Training with missing point labels improves precision. Missing point labels allow for complete segmentation.

Experimental Results Use of Missing Point Labels NML No Missing Labels

Conclusions for CNN-Based Segmentation CNN for LIDAR learned using a sampling based training pipeline. We can predict class labels over missing points in LIDAR. Incorporating missing points improves precision. Input feature maps that describe 3D shape and sensor properties have a significant effect on performance.

Proposed Work Extend CNN model to multiclass object localization, segmentation, classification, and pose estimation in 3D images. Examine design and structure of CNN components for 3D images: Depth-sensitive localization. Depth-subpixel methods for segmentation. Spatial transformer for pose estimation. Utilize domain adaptation from synthetic data for auxiliary training data and missing point reconstruction.

Novelty of Proposed Work Multi-task model for all tasks. Previous models only address up to three of the proposed tasks. Addition of 3D object pose estimation. Improve performance on all tasks by integrating algorithms of current state-of-the-art techniques for the domain of 3D objects. Balance between 2.5D image and 3D voxel representation. Incorporation of additional datasets. Comparison across urban LIDAR and indoor RGB-D domains. Missing point estimation from synthetic data or multi-view reconstruction. Domain adaptation from synthetic datasets.

2D Object Localization in LIDAR (In Progress) Preliminary results at 0.8 confidence threshold. Based on YOLO single-shot architecture. Can be used for region proposal or extended to 3D bounding boxes.

Google Street View Dataset Ground Truth Pose Labeling Automatic fit of bounding boxes PCA to fit non-axis aligned boxes Manual tool to (a) select front face (different color) for orientation (default is selected automatically) (b) change size/position/orientation of boxes in case of incomplete objects

Multi-task Model for Object Identification Shared representation can be applicable for multiple tasks. Tasks: Object localization, segmentation, classification, and pose estimation. Error signal for each task trains weights for shared representation. Source: Dai et al., Instance-aware Semantic Segmentation via Multi-task Network Cascades

Multi-task Model for Object Identification Straightforward extension to orientation estimation. Assume objects are upright, estimate rotation about gravity axis. oriented Source: Dai et al., Instance-aware Semantic Segmentation via Multi-task Network Cascades

Localization for 3D Objects in Voxel Space 3D voxel input representation (TSDF). Voxel gives relative position, anchor box gives shape prior. Network estimates adjustments for box position and dimensions. Source: Song and Xiao, Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images

Depth-Sensitive Localization We aim to maintain a non-volumetric 2.5D input representation. Partition viewing volume and consider localization in depth slices. z 4 z 3 z 2 z 1 2.5D Input 2D Conv (W, H, F in ) a z a x a y W H b x = x i + dx b y = y i + dy b Ƹ z = z i + dz Conv 3D Box (X, Y, Z A 6) b width = a x sx b height = a y sy b depth = a z sz

Subpixel Convolutions Pooled CNN features can still encode higher resolution information. Upscale back through deconvolution or subpixel convolution. Used in state-of-the-art segmentation networks. Padded Image Zero-padded Sub-Pixel Image Subpixel Filter Filter Activations Source: Shi et al., Is the deconvolution layer the same as a convolutional layer?

Subpixel Convolutions Independent subpixel filter weights can be separated. All convolutions are in low resolution then interleaved to upsample at the end of the network. Padded Image Separate Filters Filter Activations Combined Filter Activations Source: Shi et al., Is the deconvolution layer the same as a convolutional layer?

Source: Dai et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks Position-sensitive Score Maps Subpixel-like features can be specialized for a given task.

Depth-sensitive Score Maps We can extend this approach to be depth-sensitive. conv k 3 (C + 1) conv pool k vote C + 1 feature maps Top-left-back, Top-left-center, Bottom-right-center, Bottom-right-front. k k = C + 1

Spatial Transformers for Pose Estimation General method for parameterized transforms between feature maps. Interpolation of transformed sampling grid. Estimated transformation is related to 3D object pose.

Complete Model Sketch ROI pooling and spatial transformer conv down convs depth-sensitive segmentation, classification, pose estimation 2.5D image feature maps shared feature maps multi-scale depth-sensitive localization O

Timeline for Completion December 2016 Select and prepare new datasets for experiments. Annotate Street View dataset with object bounding boxes. Extend current localization and segmentation implementations for baselines. Begin implementation of classification and pose estimation baselines. January 2017 Complete implementation of baseline models and begin training models for evaluation on a chosen dataset. Implement baseline multi-task model.

Timeline for Completion February 2017 Begin some experiments with architectures using: Depth-sensitive localization. Depth-sensitive subpixel convolution for segmentation. 3D object pose estimation with spatial transformers. March 2017 Prepare paper for ICCV 2017 submission including experiments on: Multi-task learning for 3D object identification. One of the proposed depth-sensitive experimental architectures. Consider additional experiments on domain adaptation and missing point reconstruction.

Timeline for Completion April 2017 Dissertation writing. Continuation of experiments. May 2017 Dissertation defense. Prepare paper submission to 3DV 2017 containing additional experiments.

Additional Slides

Google Street View Dataset Google R5 Street View Dataset All but two pieces of NYC 0 used for training. Remaining runs used for evaluation.

KITTI Dataset 3D bounding boxes for vehicles, cyclists, and pedestrians in LIDAR. Precise segmentation labels not included in benchmark.

Synthia Dataset Synthetic urban scenes for simulated RGB-D scans. Exact labels for semantic segmentation but 3D poses are not given. Domain adaptation required for effective use on real-world data. Missing point reconstruction task can be simulated.

Indoor RGB-D Datasets SUN RGB-D and SceneNN. Class, segmentation, and oriented 3D bounding boxes included. Reconstructed shape can be used for missing points.

Assumptions for Proposed Work Single 3D image from LIDAR sensor sweep or RGB-D camera. Excludes video, multiview registration, and volumetric sensors. Possible shape completion only for missing (non-occluded) scan points. Excluding complete volumetric shape reconstruction and database matching. Hua et al., SceneNN: A Scene Meshes Dataset with annotations Wu et al., 3D ShapeNets: A Deep Representation for Volumetric Shapes