Team the Amazon Robotics Challenge 1st place in stowing task

Similar documents
arxiv: v3 [cs.ro] 20 Feb 2018

Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching

Combining RGB and Points to Predict Grasping Region for Robotic Bin-Picking

Learning from 3D Data

arxiv: v2 [cs.cv] 2 Oct 2016

3D Scene Understanding from RGB-D Images. Thomas Funkhouser

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Learning Semantic Environment Perception for Cognitive Robots

EnsembleNet: Improving Grasp Detection using an Ensemble of Convolutional Neural Networks

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Semantic RGB-D Perception for Cognitive Robots

Places Challenge 2017

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Computer Vision Lecture 16

Learning from Successes and Failures to Grasp Objects with a Vacuum Gripper

Computer Vision Lecture 16

Towards Grasp Transfer using Shape Deformation

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Manipulating a Large Variety of Objects and Tool Use in Domestic Service, Industrial Automation, Search and Rescue, and Space Exploration

Photo OCR ( )

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Semantic Segmentation

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

arxiv: v2 [cs.ro] 7 Sep 2018

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

The Crucial Components to Solve the Picking Problem

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D model classification using convolutional neural network

Visual Perception for Robots

Team G-RMI: Google Research & Machine Intelligence

Tri-modal Human Body Segmentation

Learning to Singulate Objects using a Push Proposal Network

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Learning-based Localization

Hand-Object Interaction Detection with Fully Convolutional Networks

Cascade Region Regression for Robust Object Detection

Fuzzy Set Theory in Computer Vision: Example 3

TorontoCity: Seeing the World with a Million Eyes

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Efficient Segmentation-Aided Text Detection For Intelligent Robots

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today?

Depth from Stereo. Dominic Cheng February 7, 2018

RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Finding Tiny Faces Supplementary Materials

A Data-Efficient Approach to Precise and Controlled Pushing

Channel Locality Block: A Variant of Squeeze-and-Excitation

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Semantic Segmentation from Limited Training Data

Multi-View 3D Object Detection Network for Autonomous Driving

Classification of objects from Video Data (Group 30)

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Recurrent Convolutional Neural Networks for Scene Labeling

arxiv: v1 [cs.cv] 29 Sep 2016

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Human Detection and Tracking for Video Surveillance: A Cognitive Science Approach

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Flow-Based Video Recognition

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Using RGB, Depth, and Thermal Data for Improved Hand Detection

ImageNet Classification with Deep Convolutional Neural Networks

Fast Semantic Segmentation of RGB-D Scenes with GPU-Accelerated Deep Neural Networks

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

arxiv: v3 [cs.ro] 9 Nov 2017

Edge and corner detection

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Semantic Segmentation from Limited Training Data

A Deep Learning Approach to Vehicle Speed Estimation

ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems (Supplementary Materials)

Detection and Fine 3D Pose Estimation of Texture-less Objects in RGB-D Images

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

with Deep Learning A Review of Person Re-identification Xi Li College of Computer Science, Zhejiang University

ECE 5470 Classification, Machine Learning, and Neural Network Review

Real-Time Object Pose Estimation with Pose Interpreter Networks

CS231N Section. Video Understanding 6/1/2018

C-Brain: A Deep Learning Accelerator

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Structured Prediction using Convolutional Neural Networks

Self-Supervised Learning & Visual Discovery

Finding Surface Correspondences With Shape Analysis

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

DEEP BLIND IMAGE QUALITY ASSESSMENT

AUTOMATED DETECTION AND CLASSIFICATION OF CANCER METASTASES IN WHOLE-SLIDE HISTOPATHOLOGY IMAGES USING DEEP LEARNING

Finding Structure in Large Collections of 3D Models

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

From 3D descriptors to monocular 6D pose: what have we learned?

Visual Computing TUM

Object Detection Based on Deep Learning

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

Pairwise Threshold for Gaussian Mixture Classification and its Application on Human Tracking Enhancement

Classifying a specific image region using convolutional nets with an ROI mask as input

Fitting (LMedS, RANSAC)

Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

Joint Object Detection and Viewpoint Estimation using CNN features

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Object detection with CNNs

Transcription:

Grasping

Team MIT-Princeton @ the Amazon Robotics Challenge 1st place in stowing task Andy Zeng Shuran Song Kuan-Ting Yu Elliott Donlon Francois Hogan Maria Bauza Daolin Ma Orion Taylor Melody Liu Eudald Romo Nima Fazeli Ferran Alet Nikhil Dafle Rachel Holladay Isabella Morona Prem Qu Nair Druck Green Ian Taylor Weber Liu Thomas Funkhouser Alberto Rodriguez

From model-based to model-free Model-based grasping Pose estimation Grasp planning Works well with known objects in structured environments Can t handle novel objects in unstructured environments (due to pose estimation)

From model-based to model-free Model-based grasping Pose estimation Grasp planning Works well with known objects in structured environments Can t handle novel objects in unstructured environments (due to pose estimation) Model-free grasping Visual data Grasp planning Use local geometric features Ignore object identity End-to-end Motivated by industry

Recent work on model-free grasping Grasp Pose Detection M. Gualtieri et al., 17 Supersizing Self-Supervision L. Pinto and A. Gupta, 16 Handles clutter and novel objects DexNet 1.0-3.0 J. Mahler et al., 17

Recent work on model-free grasping Grasp Pose Detection M. Gualtieri et al., 17 Supersizing Self-Supervision L. Pinto and A. Gupta, 16 DexNet 1.0-3.0 J. Mahler et al., 17 Handles clutter on tabletop scenarios and novel objects selected beforehand

Recent work on model-free grasping Grasp Pose Detection M. Gualtieri et al., 17 Supersizing Self-Supervision L. Pinto and A. Gupta, 16 DexNet 1.0-3.0 J. Mahler et al., 17 Common limitations: low grasp sample density, small neural network sizes

In this talk A model-free grasping method

In this talk A model-free grasping method Handles dense clutter in tabletop bin/box scenario Rethink dense clutter: Objects not only tightly packed, but also tossed and stacked on top of each other Objects in corners and on bin edges

In this talk A model-free grasping method Handles dense clutter in tabletop bin/box scenario Works for novel objects of all kinds (i.e. any household object should be fair game) 90-95% grasping accuracy is not enough Objects without depth data...

In this talk A model-free grasping method Handles dense clutter in tabletop bin/box scenario Works for novel objects of all kinds (i.e. any household object should be fair game) Fast and efficient Standard: Grasp sampling Ours: Grasp ranking Dense pixel-wise predictions

In this talk A model-free grasping method Handles dense clutter in tabletop bin/box scenario Works for novel objects of all kinds (i.e. any household object should be fair game) Fast and efficient 1st place stowing task at Amazon Robotics Challenge 17 (i.e. it works) The Beast from the East setup competition footage

Overview: multi-affordance grasping Input: multi-view RGB-D images

Overview: multi-affordance grasping Input: multi-view RGB-D images Output: dense grasp proposals and affordance scores for 4 primitive grasping behaviors: suction down suction side grasp down flush grasp

Dense pixel-wise affordances with FCNs Input RGB-D images fully convolutional ResNet-50

Dense pixel-wise affordances with FCNs Input RGB-D images suction down suction side fully convolutional ResNet-50

Dense pixel-wise affordances with FCNs Input RGB-D images suction down suction side fully convolutional ResNet-50

Dense pixel-wise affordances with FCNs Input RGB-D images suction down suction side fully convolutional ResNet-50 What about grasping?

Dense pixel-wise affordances with FCNs Input RGB-D images suction down suction side fully convolutional ResNet-50 RGB-D heightmaps

Dense pixel-wise affordances with FCNs Input RGB-D images suction down suction side fully convolutional ResNet-50 RGB-D heightmaps grasp down flush grasp

Dense pixel-wise affordances with FCNs Input RGB-D images suction down suction side fully convolutional ResNet-50 RGB-D heightmaps grasp down flush grasp predicts horizontal grasp affordances

Training data Manual labeling ~100 different household/office objects Suctionable areas Parallel-jaw grasps

Generalization from hardware capabilities High-powered deployable suction Actuated spatula

Pros and cons Advantages: Fast runtime speeds from efficient convolution

Pros and cons Advantages: Fast runtime speeds from efficient convolution Uses both color and depth information

Pros and cons Advantages: Fast runtime speeds from efficient convolution Uses both color and depth information Can leverage fat pre-trained networks Higher good grasp recall Standard: Grasp sampling Ours: Grasp ranking

Pros and cons Advantages: Fast runtime speeds from efficient convolution Uses both color and depth information Can leverage fat pre-trained networks Higher good grasp recall Limitations: Considers only top-down parallel-jaw grasps Can trivially extend to more grasp angles Limited to grasping behaviors for which you can define affordances (no real planning) Open-loop

Future work Model-based grasping Pose estimation Grasp planning Model-free grasping Visual data Grasp planning

Future work Model-based grasping Pose estimation Grasp planning Model-free grasping Visual data Grasp planning How can we improve model-free by making it more like model-based?

Future work Model-based grasping Model-free grasping Semantic Scene Completion from a Single Depth Image [Song et al., CVPR 17]

Takeaways A model-free grasping method FCNs to compute dense affordance predictions for multiple grasping behaviors (suction, parallel-jaw) Multiple grasping primitive behaviors dense clutter in bin/box scenario Multi-view color and depth + diverse training data + robust hardware handle novel objects of all kinds FCNs for grasping affordance predictions efficiency and high grasp recall

Takeaways A model-free grasping method FCNs to compute dense affordance predictions for multiple grasping behaviors (suction, parallel-jaw) Multiple grasping primitive behaviors dense clutter in bin/box scenario Multi-view color and depth + diverse training data + robust hardware handle novel objects of all kinds FCNs for grasping affordance predictions efficiency and high grasp recall Paper and code are available: arc.cs.princeton.edu

Recognition of novel objects without retraining Match real images of novel objects to their product images (available at test time) After isolating object from clutter with model-free grasping, perform recognition

Cross domain image matching (training) product images observed images ℓ2 distance ratio loss match?

Cross domain image matching (training) product images observed images ℓ2 distance ratio loss match? softmax loss for K-Net only

Cross domain image matching (testing) feature embedding known novel

Cross domain image matching (testing) input feature embedding known novel

Cross domain image matching (testing) input feature embedding known novel match!

Cross domain image matching (testing) input feature embedding known novel match! Pre-trained ImageNet features