Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Similar documents
Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Multi-View 3D Object Detection Network for Autonomous Driving

Learning from 3D Data

Robotics Programming Laboratory

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Colored Point Cloud Registration Revisited Supplementary Material

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Object Detection on Self-Driving Cars in China. Lingyun Li

3D object recognition used by team robotto

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

CS 231A Computer Vision (Winter 2014) Problem Set 3

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Deep Learning for Robust Normal Estimation in Unstructured Point Clouds. Alexandre Boulch. Renaud Marlet

POINT CLOUD DEEP LEARNING

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Multi-view Stereo. Ivo Boyadzhiev CS7670: September 13, 2011

Depth from Stereo. Dominic Cheng February 7, 2018

Object Detection by 3D Aspectlets and Occlusion Reasoning

Image Based Reconstruction II

Trademark Matching and Retrieval in Sport Video Databases

Presented at the FIG Congress 2018, May 6-11, 2018 in Istanbul, Turkey

arxiv: v1 [cs.cv] 20 Dec 2016

CSc Topics in Computer Graphics 3D Photography

Beyond Bags of features Spatial information & Shape models

From Structure-from-Motion Point Clouds to Fast Location Recognition

Multi-view stereo. Many slides adapted from S. Seitz

3D Shape Segmentation with Projective Convolutional Networks

3D Photography: Stereo

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

CRF Based Point Cloud Segmentation Jonathan Nation

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

Object Detection Based on Deep Learning

Learning Semantic Environment Perception for Cognitive Robots

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Joint Object Detection and Viewpoint Estimation using CNN features

Seminar Heidelberg University

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

3D Object Representations. COS 526, Fall 2016 Princeton University

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

arxiv: v1 [cs.cv] 28 Sep 2018

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Correspondence. CS 468 Geometry Processing Algorithms. Maks Ovsjanikov

Spatial Localization and Detection. Lecture 8-1

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

3D Photography: Active Ranging, Structured Light, ICP

Urban Scene Segmentation, Recognition and Remodeling. Part III. Jinglu Wang 11/24/2016 ACCV 2016 TUTORIAL

Using Faster-RCNN to Improve Shape Detection in LIDAR

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

A Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

Motion Tracking and Event Understanding in Video Sequences

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Deep Models for 3D Reconstruction

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Training models for road scene understanding with automated ground truth Dan Levi

CS468: 3D Deep Learning on Point Cloud Data. class label part label. Hao Su. image. May 10, 2017

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Deformable Part Models

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Contexts and 3D Scenes

Final Exam Study Guide

Geometric Reconstruction Dense reconstruction of scene geometry

Multiview Reconstruction

Learning to Segment Object Candidates

Bus Detection and recognition for visually impaired people

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

A Systems View of Large- Scale 3D Reconstruction

3D Shape Modeling by Deformable Models. Ye Duan

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Separating Objects and Clutter in Indoor Scenes

Bridging the Gap Between Local and Global Approaches for 3D Object Recognition. Isma Hadji G. N. DeSouza

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Classification of objects from Video Data (Group 30)

TRAFFIC SIGN RECOGNITION USING A MULTI-TASK CONVOLUTIONAL NEURAL NETWORK

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Object detection with CNNs

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs Supplementary Material

EE795: Computer Vision and Intelligent Systems

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Charles R. Qi* Hao Su* Kaichun Mo Leonidas J. Guibas

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Robot localization method based on visual features and their geometric relationship

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

CAP 6412 Advanced Computer Vision

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Martian lava field, NASA, Wikipedia

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

3D model classification using convolutional neural network

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

MULTI-LEVEL 3D CONVOLUTIONAL NEURAL NETWORK FOR OBJECT RECOGNITION SAMBIT GHADAI XIAN LEE ADITYA BALU SOUMIK SARKAR ADARSH KRISHNAMURTHY

Transcription:

Allan Zelener Dissertation Proposal December 12 th 2016 Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Overview 1. Introduction to 3D Object Identification 2. Completed Work Part-based Object Classification of Vehicle Point Clouds. CNN-based Object Segmentation in LIDAR with Missing Points. 3. Proposed Work Joint localization, segmentation, classification, and 3D pose estimation. Depth-sensitive localization. Depth-sensitive subpixel methods for segmentation. Spatial transformers for pose estimation. Domain adaptation and shape completion from synthetic data. Timeline for completion.

Identifying 3D Objects Real world objects have a 3D shape and a position in a 3D scene. Objects may be oriented with respect to some reference pose. These object properties are associated with their semantic class.

Identifying 3D Objects

Identifying Objects in 2D Images Fei-Fei, Karpathy, Johnson (http://cs231n.stanford.edu/slides/winter1516_lecture13.pdf)

Identifying 3D Objects in 2D Images 3D oriented CAD models mapped to 2D image regions. Approximate 3D shape based on selected models. Relative 3D position and scale may still be ambiguous. Visual perspective cues required to estimate object properties. Yu et al., ObjectNet3D: A Large Scale Database for 3D Object Recognition

Identifying 3D Objects in 3D Images 3D sensors provide accurate pointwise depth measurements. Object position and scale can be determined from a single 3D image. Song et al., SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite

Challenges in 3D Images Manual Labeling of 3D Point Cloud Missing measurements due to sensor properties. Partial 3D data based on limited viewpoints. Difficult large-scale annotation compared to 2D images. Feature representations for 3D properties.

Completed Work Classification of Vehicle Parts in Unstructured 3D Point Clouds RANSAC point clustering for planar parts. Part-based structured model for classifying parts and overall object class. Classification of Vehicle Parts in Unstructured 3D Point Clouds, Zelener, Mordohai, and Stamos, 3DV, 2014.

Local Feature Extraction Density weighted spin images. Dense sampling of keypoints on a uniformly spaced voxel grid. Normals oriented away from center of object centroid. K-means clustering to generate bag-of-words codebook. Baseline object descriptor is normalized count vector of codebook features. K-Means Spin Image Codebook k = 50

Automatic Part Segmentation Iterative RANSAC plane fitting. Candidate planes from faces of convex hull. Robust re-estimation of planes using PCA. For vehicles, five planar parts cover most of the surface Convex Hull Examples Colored by Segmentation Order

Part-Level Features Spin image bag-of-words. Average height ഥh. Horizontal/vertical indicator I n = 0, if nt z > cos π 4 1, otherwise Mean, median, and max of plane fit errors. Eigenvalues from plane fitting λ 1, λ 2, λ 3 (in descending order). Linearity (λ 1 λ 2 ) and Planarity λ 2 λ 3.

Pairwise Part Features Dot product of normals, n 1 T n 2 Absolute difference in average heights, h 1 h 2 Distance between centroids, c 1 c 2 Closest distance between points, min i P 1,j P 2 p 1,i p 2,j Coplanarity as mean, median, and max cross-plane fit errors.

Structured Part Modeling Generalized HMM as sequence of parts and final class variable. Trained discriminatively by structured averaged perceptron. Parts reordered in sequence based on I(n) and average height. a 1 a 2 a n c x 1 x 2 x n x 1 x 2 x n

Experimental Results for Part Classification Evaluation on Ottawa dataset with 155 sedans and 67 SUVs. Structured part modeling provides increased performance for part classification. Manual segmentation provides increase for classification of all parts per object. Part Classification Comparison

Experimental Results for Object Classification SP gives significant gains over baseline perceptron model. Manual segmentation with SP exceeds unstructured baselines. Sedan vs SUV Object Classification No Part Segmentation Part Segmentation

Comparison Between Automatic and Manual Segmentation Under-segmentation from unbounded plane fitting. Merged semantic part classes like roof-hood and roof-trunk. Inconsistent labeling behavior at boundaries and noisy points. Automatic Manual

Conclusions for Part-based Classification PROS RANSAC segmentation is robust to many complexities of 3D data. Structured part-based method shows improvement over bagof-words with local features. Pairwise features based on geometric properties improve classification performance. CONS RANSAC segmentation is not equivalent to semantic segmentation. Labeling ground truth parts for every possible object class may be infeasible. RANSAC segmentation, features, and structure model are determined before training the classifier.

CNN-Based Object Segmentation Segmentation on LIDAR scanning grid with missing points. CNN training procedure for LIDAR data. CNN-based features extracted from small set of initial feature maps for 3D images. CNN-Based Object Segmentation in Urban LIDAR with Missing Points, Zelener and Stamos, 3DV, 2016.

Missing Points in LIDAR Contiguous LIDAR scanlines form 2.5D grid of scanner measurements. Laser reflection causes missing points on objects in the grid. We can label and infer over these positions. Missing Points in Gray on Scanning Grid Missing Points on Vehicles are Labeled

Preprocessing Pipeline Sample positive and negative locations in large LIDAR scene piece. Extract M M patch as input to CNN. Predict labels for central K K region, K M. (M = 64, K = 8)

Initial Feature Maps Compute normalized feature maps from 3D points in M M patch. Assume ~N(0,1) truncated to [ 6, 6] within each patch. Missing data given max value (6) in clip range. 6 0 Relative Depth Relative Height -6

Initial Feature Maps Angle and missing mask describe sensor properties. Angle normalized as before and missing mask in {0,1}. 6 1 0-6 Angle Missing Mask 0

Initial Feature Maps Signed Angle from Hadjiliadis and Stamos. 3DPVT 2010. 6 z v 2 p 3 0-6 Signed Angle v 1 SignedAngle p 2 p 1 p 2 Scanning Direction = acos( zƹ v 2 ) sgn v 1 v 2 Horizontal surfaces at 90 degrees. Vertical surfaces at 0 degrees. Sharp changes yield negative sign.

L x, y Model Overview Baseline CNN architecture. ReLU nonlinear activation functions. L2-regularization on affine layers. Dropout regularization on final layer. Predict binary label for each point in the K K target. Total model loss is = K 2 [y k log p k + (1 y k )log (1 p k )] + λ k=1 Binary Cross Entropy L 2 l=1 W l 2 2 L2-Regularization Input Patch Conv 5 5 Conv 5 5 Affine Affine Output Labels (64, 64, 5) (32, 32, 32) (16, 16, 64) 512 64 (= K 2 )

Results from Vehicle Point Detection using CNN [patch size = 64 x 64, target size = 8 x 8] nyc_0 (in-sample) test piece nyc_1 test piece True Positive Yellow True Negative Dark Blue False Positive Cyan False Negative Orange

True Positive Yellow True Negative Dark Blue Nyc_0 (In-sample) Test Recall.85, Precision.73 False Positive Cyan False Negative Orange

True Positive Yellow True Negative Dark Blue Nyc_1 Test Recall.85, Precision.73 False Positive Cyan False Negative Orange

Experimental Results Input Feature Map Comparison D Depth, H Height, A Angle, S Signed Angle, M - Missing Mask

Impact of Using Missing Point Labels DHASM with Missing Point Labels DHASM with No Missing Point Labels Training with missing point labels improves precision. Missing point labels allow for complete segmentation.

Experimental Results Use of Missing Point Labels NML No Missing Labels

Conclusions for CNN-Based Segmentation CNN for LIDAR learned using a sampling based training pipeline. We can predict class labels over missing points in LIDAR. Incorporating missing points improves precision. Input feature maps that describe 3D shape and sensor properties have a significant effect on performance.

Proposed Work Extend CNN model to multiclass object localization, segmentation, classification, and pose estimation in 3D images. Examine design and structure of CNN components for 3D images: Depth-sensitive localization. Depth-subpixel methods for segmentation. Spatial transformer for pose estimation. Utilize domain adaptation from synthetic data for auxiliary training data and missing point reconstruction.

Novelty of Proposed Work Multi-task model for all tasks. Previous models only address up to three of the proposed tasks. Addition of 3D object pose estimation. Improve performance on all tasks by integrating algorithms of current state-of-the-art techniques for the domain of 3D objects. Balance between 2.5D image and 3D voxel representation. Incorporation of additional datasets. Comparison across urban LIDAR and indoor RGB-D domains. Missing point estimation from synthetic data or multi-view reconstruction. Domain adaptation from synthetic datasets.

2D Object Localization in LIDAR (In Progress) Preliminary results at 0.8 confidence threshold. Based on YOLO single-shot architecture. Can be used for region proposal or extended to 3D bounding boxes.

Google Street View Dataset Ground Truth Pose Labeling Automatic fit of bounding boxes PCA to fit non-axis aligned boxes Manual tool to (a) select front face (different color) for orientation (default is selected automatically) (b) change size/position/orientation of boxes in case of incomplete objects

Multi-task Model for Object Identification Shared representation can be applicable for multiple tasks. Tasks: Object localization, segmentation, classification, and pose estimation. Error signal for each task trains weights for shared representation. Source: Dai et al., Instance-aware Semantic Segmentation via Multi-task Network Cascades

Multi-task Model for Object Identification Straightforward extension to orientation estimation. Assume objects are upright, estimate rotation about gravity axis. oriented Source: Dai et al., Instance-aware Semantic Segmentation via Multi-task Network Cascades

Localization for 3D Objects in Voxel Space 3D voxel input representation (TSDF). Voxel gives relative position, anchor box gives shape prior. Network estimates adjustments for box position and dimensions. Source: Song and Xiao, Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images

Depth-Sensitive Localization We aim to maintain a non-volumetric 2.5D input representation. Partition viewing volume and consider localization in depth slices. z 4 z 3 z 2 z 1 2.5D Input 2D Conv (W, H, F in ) a z a x a y W H b x = x i + dx b y = y i + dy b Ƹ z = z i + dz Conv 3D Box (X, Y, Z A 6) b width = a x sx b height = a y sy b depth = a z sz

Subpixel Convolutions Pooled CNN features can still encode higher resolution information. Upscale back through deconvolution or subpixel convolution. Used in state-of-the-art segmentation networks. Padded Image Zero-padded Sub-Pixel Image Subpixel Filter Filter Activations Source: Shi et al., Is the deconvolution layer the same as a convolutional layer?

Subpixel Convolutions Independent subpixel filter weights can be separated. All convolutions are in low resolution then interleaved to upsample at the end of the network. Padded Image Separate Filters Filter Activations Combined Filter Activations Source: Shi et al., Is the deconvolution layer the same as a convolutional layer?

Source: Dai et al., R-FCN: Object Detection via Region-based Fully Convolutional Networks Position-sensitive Score Maps Subpixel-like features can be specialized for a given task.

Depth-sensitive Score Maps We can extend this approach to be depth-sensitive. conv k 3 (C + 1) conv pool k vote C + 1 feature maps Top-left-back, Top-left-center, Bottom-right-center, Bottom-right-front. k k = C + 1

Spatial Transformers for Pose Estimation General method for parameterized transforms between feature maps. Interpolation of transformed sampling grid. Estimated transformation is related to 3D object pose.

Complete Model Sketch ROI pooling and spatial transformer conv down convs depth-sensitive segmentation, classification, pose estimation 2.5D image feature maps shared feature maps multi-scale depth-sensitive localization O

Timeline for Completion December 2016 Select and prepare new datasets for experiments. Annotate Street View dataset with object bounding boxes. Extend current localization and segmentation implementations for baselines. Begin implementation of classification and pose estimation baselines. January 2017 Complete implementation of baseline models and begin training models for evaluation on a chosen dataset. Implement baseline multi-task model.

Timeline for Completion February 2017 Begin some experiments with architectures using: Depth-sensitive localization. Depth-sensitive subpixel convolution for segmentation. 3D object pose estimation with spatial transformers. March 2017 Prepare paper for ICCV 2017 submission including experiments on: Multi-task learning for 3D object identification. One of the proposed depth-sensitive experimental architectures. Consider additional experiments on domain adaptation and missing point reconstruction.

Timeline for Completion April 2017 Dissertation writing. Continuation of experiments. May 2017 Dissertation defense. Prepare paper submission to 3DV 2017 containing additional experiments.

Additional Slides

Google Street View Dataset Google R5 Street View Dataset All but two pieces of NYC 0 used for training. Remaining runs used for evaluation.

KITTI Dataset 3D bounding boxes for vehicles, cyclists, and pedestrians in LIDAR. Precise segmentation labels not included in benchmark.

Synthia Dataset Synthetic urban scenes for simulated RGB-D scans. Exact labels for semantic segmentation but 3D poses are not given. Domain adaptation required for effective use on real-world data. Missing point reconstruction task can be simulated.

Indoor RGB-D Datasets SUN RGB-D and SceneNN. Class, segmentation, and oriented 3D bounding boxes included. Reconstructed shape can be used for missing points.

Assumptions for Proposed Work Single 3D image from LIDAR sensor sweep or RGB-D camera. Excludes video, multiview registration, and volumetric sensors. Possible shape completion only for missing (non-occluded) scan points. Excluding complete volumetric shape reconstruction and database matching. Hua et al., SceneNN: A Scene Meshes Dataset with annotations Wu et al., 3D ShapeNets: A Deep Representation for Volumetric Shapes