Multi-View 3D Object Detection Network for Autonomous Driving

Similar documents
Multi-View 3D Object Detection Network for Autonomous Driving

Multi-View 3D Object Detection Network for Autonomous Driving

arxiv: v4 [cs.cv] 12 Jul 2018

3D Object Proposals using Stereo Imagery for Accurate Object Class Detection

Joint Object Detection and Viewpoint Estimation using CNN features

Object Detection on Self-Driving Cars in China. Lingyun Li

arxiv: v1 [cs.cv] 27 Mar 2019

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Learning to Segment Object Candidates

Automotive 3D Object Detection Without Target Domain Annotations

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Detecting the Unexpected: The Path to Road Obstacles Prevention in Autonomous Driving

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

HDNET: Exploiting HD Maps for 3D Object Detection

From 3D descriptors to monocular 6D pose: what have we learned?

Lecture 7: Semantic Segmentation

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation

Regionlet Object Detector with Hand-crafted and CNN Feature

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

Creating Affordable and Reliable Autonomous Vehicle Systems

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Rich feature hierarchies for accurate object detection and semant

Fusion of Camera and Lidar Data for Object Detection using Neural Networks

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Project 3 Q&A. Jonathan Krause

RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN

Mapping with Dynamic-Object Probabilities Calculated from Single 3D Range Scans

Visual Perception for Autonomous Driving on the NVIDIA DrivePX2 and using SYNTHIA

arxiv: v1 [cs.cv] 16 Mar 2018

Training models for road scene understanding with automated ground truth Dan Levi

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

W4. Perception & Situation Awareness & Decision making

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

AUTONOMOUS driving is receiving a lot of attention

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Depth from Stereo. Dominic Cheng February 7, 2018

YOLO9000: Better, Faster, Stronger

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Spatial Localization and Detection. Lecture 8-1

Learning for Active 3D Mapping

Deep Learning for Object detection & localization

Separating Objects and Clutter in Indoor Scenes

Fully Convolutional Networks for Semantic Segmentation

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

CAP 6412 Advanced Computer Vision

Deep Models for 3D Reconstruction

6D Object Pose Estimation Binaries

arxiv: v1 [cs.cv] 17 Apr 2019

Learning Semantic Environment Perception for Cognitive Robots

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Deep Semantic Classification for 3D LiDAR Data

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Training models for road scene understanding with automated ground truth Dan Levi

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Yiqi Yan. May 10, 2017

Category vs. instance recognition

arxiv: v3 [cs.cv] 18 Mar 2019

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Multi-Level Fusion based 3D Object Detection from Monocular Images

Deep Continuous Fusion for Multi-Sensor 3D Object Detection

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

Deep Learning for Robust Normal Estimation in Unstructured Point Clouds. Alexandre Boulch. Renaud Marlet

Self Driving. DNN * * Reinforcement * Unsupervised *

Instance-aware Semantic Segmentation via Multi-task Network Cascades

THE goal of action detection is to detect every occurrence

Unified, real-time object detection

2 OVERVIEW OF RELATED WORK

arxiv: v1 [cs.cv] 29 Nov 2017

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Recurrent Convolutional Neural Networks for Scene Labeling

Object Detection Design challenges

Tri-modal Human Body Segmentation

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington


Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

S7348: Deep Learning in Ford's Autonomous Vehicles. Bryan Goodman Argo AI 9 May 2017

Two-Stream Convolutional Networks for Action Recognition in Videos

Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks - Counting, Detection, and Tracking

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Classification of objects from Video Data (Group 30)

arxiv: v1 [cs.cv] 14 Jun 2016

Midterm Examination CS 534: Computational Photography

Multisensoral UAV-Based Reference Measurements for Forestry Applications

SIMULATED LIDAR WAVEFORMS FOR THE ANALYSIS OF LIGHT PROPAGATION THROUGH A TREE CANOPY

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Transcription:

Multi-View 3D Object Detection Network for Autonomous Driving Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, Tian Xia CVPR 2017 (Spotlight) Presented By: Jason Ku

Overview Motivation Dataset Network Architecture Multi-View Inputs 3D Proposal Network Multi-View ROI Pooling Fusion Network Network Regularization Training Results Summary Improvements

Motivation 3D detections are much more useful for autonomous driving LIDAR data contains accurate depth information Requires well designed model to take advantage of multiple views 2 stage architectures provide higher quality bounding boxes

Dataset KITTI images, only car instances Half split of training images for training and validation set (~3700 each) KITTI test server only evaluates 2D detections 3D detections evaluated on validation set

Network Architecture VGG-16 base network for each view Channels reduced by half 4th max pooling operation is removed Extra fully connected layer fc8 Weights initialized by sampling weights from VGG-16 Even with 3 branches, only 75% number of parameters as full VGG-16

Multi-View Inputs Bird s Eye View (BV) Front View (FV) Camera Image (RGB) Upscaled to make shortest length 500 pixels

Bird s-eye View (BV) Discretized LIDAR point cloud with 0.1m resolution ~90 FOV images Range of [0, 70.4] (depth) x [-40, 40] 704 x 800 pixels Features in (M + 2) channels: M Height Maps Point cloud divided into M equal slices Maximum height of points in cell Probably to address tunnels, bridges, or trees Density Number of points in each cell Normalized as N is the number of points in the cell Intensity LIDAR reflectance of point with maximum height in cell

Front View (FV) Projects LIDAR point cloud to cylinder plane Denser map than projection to 2D point map Height, Distance, Intensity 64 beam Velodyne => 64 x 512 pixels LIDAR point cloud to cylinder plane conversion: Given a 3D point, its coordinates in the front view can be computed using where and are horizontal and vertical resolution of laser beams

3D Proposal Network Bird s eye view input Preserves physical size Objects occupy different space Provides better predictions since objects are grounded, and encodes depth information Feature upsampling for small objects 2x bilinear feature map upsampling Proposal network gets 4x downsampled input (176 x 200 px)

3D Proposal Network 3D Anchors 3D prior boxes created by clustering ground truth object sizes Represented with center and sizes (l, w) = {(3.9, 1.6), (1.0, 0.6)}, h = 1.56m Orientations {0, 90 }, not regressed Close to orientations of most road scene objects 4 boxes Proposal Filtering Remove background and empty proposals 0.7 IoU NMS in BV, based on objectness score Top 2000 proposals for training Top 300 for testing

3D Proposal Bounding Box Regression Parameterized as t = ( x, y, z, l, w, h) ( x, y, z) are the center offsets normalized by anchor size ( l, w, h) are computed as Multi-task loss Class-entropy (log loss) for objectness Smooth L1 (distance) for 3D box regression

Multi-View ROI Pooling 3D box proposals projected into each view 4x/4x/2x upsampled feature vector Region of Interest (ROI) pooling to create same length feature vectors

Fusion Network Combines information from different feature vectors

Early and Late Fusion Early Fusion Features combined at the input stage For L layers, Where are feature transformation functions, and is a join operation (concatenation) Late Fusion Separate subnetworks learn features independently Output combined at the prediction stage

Deep Fusion Element wise mean for join operation More interaction among features More flexible when combined with drop-path training

Oriented 3D Box Regression Uses fusion features of the multi-view network 8 corner representation Multi-task loss Normalized by diagonal length of proposal box 24D vector is redundant, but works better than centres and sizes approach Orientation computed from 3D box corners Cross-entropy for classification Smooth L1 loss for 3D bounding box During inference, NMS on 3D boxes in BV with IoU < 0.05 Want no overlapping boxes

Network Regularization - Drop Path Training Randomly choose global or local drop-path with 50% probability Global Select single view from 3 views with equal probability Local Path input to each join node are dropped with 50% probability At least 1 path is kept

Network Regularization - Auxiliary Losses Additional paths and losses Same number of layers as main network Parameter sharing with main network Strengthens the representation capability of each view All losses weighted equally Not used during inference

Training Trained end-to-end on training split (~3700 images) Trained with SGD Learning rate of 0.001 for 100K iterations 0.0001 for another 20K iterations 3D detections evaluated on validation set only

Results - 3D Proposal Recall Recall vs IoU using 300 proposals Moderate KITTI Data With 300 proposals 99.1% recall at 0.25 IoU 91% recall at 0.5 IoU Recall vs # Proposals at 0.25 IoU Recall vs # Proposals at 0.5 IoU

Results - 3D Detections Inference time for one image: 0.36s on Titan X GPU

Results - Feature Fusion, Multi-View Features

Results - Feature Fusion, Multi-View Features

Results - 2D Detections

Qualitative Results Top Left: 3DOP Top Right: VeloFCN Right: Multi-View

Qualitative Results Top Left: 3DOP Top Right: VeloFCN Right: Multi-View

Summary Multi-view input representations Region-based deep fusion network Improves LIDAR and image-based methods Outperforms other methods by ~25% and 30% AP for 3D detections 2D detections are also competitive

Shortcomings / Improvements LIDAR vs Stereo Data Inference time 0.36s almost fast enough Pre-processed input representations Code not released 3D detections only tested on validation set No detections for pedestrians or cyclists Points may be too sparse Data augmentation required for more instances in KITTI

Questions?