(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Similar documents
3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

arxiv: v2 [cs.cv] 18 Aug 2015

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Convolutional-Recursive Deep Learning for 3D Object Classification

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Fully Convolutional Networks for Semantic Segmentation

Efficient Segmentation-Aided Text Detection For Intelligent Robots

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Large-scale Video Classification with Convolutional Neural Networks

arxiv: v1 [cs.ro] 18 Jul 2017

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Object Detection on Self-Driving Cars in China. Lingyun Li

Semantic Segmentation

arxiv: v1 [cs.cv] 31 Mar 2016

CS231N Section. Video Understanding 6/1/2018

Learning Semantic Environment Perception for Cognitive Robots

Presentation Outline. Semantic Segmentation. Overview. Presentation Outline CNN. Learning Deconvolution Network for Semantic Segmentation 6/6/16

Lecture 7: Semantic Segmentation

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Multi-view 3D Models from Single Images with a Convolutional Network

Structured Prediction using Convolutional Neural Networks

Dynamic Routing Between Capsules

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Classification of objects from Video Data (Group 30)

Aalborg Universitet. Published in: International Joint Conference on Computational Intelligence

S7348: Deep Learning in Ford's Autonomous Vehicles. Bryan Goodman Argo AI 9 May 2017

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

RGBD Occlusion Detection via Deep Convolutional Neural Networks

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Multimodal Gesture Recognition using Multi-stream Recurrent Neural Network

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Mapping with Dynamic-Object Probabilities Calculated from Single 3D Range Scans

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

arxiv: v1 [cs.cv] 20 Dec 2016

Martian lava field, NASA, Wikipedia

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Two-Stream Convolutional Networks for Action Recognition in Videos

Recurrent Convolutional Neural Networks for Scene Labeling

TorontoCity: Seeing the World with a Million Eyes

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Charles R. Qi* Hao Su* Kaichun Mo Leonidas J. Guibas

Tri-modal Human Body Segmentation

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

Multi-View 3D Object Detection Network for Autonomous Driving

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

Visual Material Traits Recognizing Per-Pixel Material Context

CAP 6412 Advanced Computer Vision

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Training models for road scene understanding with automated ground truth Dan Levi

Conditional Random Fields as Recurrent Neural Networks

Action recognition in robot-assisted minimally invasive surgery

Supplemental Material for Can Visual Recognition Benefit from Auxiliary Information in Training?

Rich feature hierarchies for accurate object detection and semantic segmentation

A Deep Learning Approach to Vehicle Speed Estimation

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Regionlet Object Detector with Hand-crafted and CNN Feature

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Internet of things that video

Spatial Localization and Detection. Lecture 8-1

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

Presented at the FIG Congress 2018, May 6-11, 2018 in Istanbul, Turkey

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

An Exploration of Computer Vision Techniques for Bird Species Classification

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Semantic RGB-D Perception for Cognitive Robots

Visual features detection based on deep neural network in autonomous driving tasks

Camera-based Vehicle Velocity Estimation using Spatiotemporal Depth and Motion Features

THe goal of scene recognition is to predict scene labels. Learning Effective RGB-D Representations for Scene Recognition

Finding Tiny Faces Supplementary Materials

Photo OCR ( )

VISION & LANGUAGE From Captions to Visual Concepts and Back

Hand-Object Interaction Detection with Fully Convolutional Networks

MASTER THESIS. A Local - Global Approach to Semantic Segmentation in Aerial Images

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

Learning to Segment Object Candidates

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

EE-559 Deep learning Networks for semantic segmentation

Learning-based Localization

Classifying Depositional Environments in Satellite Images

A Deep Learning Framework for Authorship Classification of Paintings

YOLO9000: Better, Faster, Stronger

ECE 5470 Classification, Machine Learning, and Neural Network Review

Rich feature hierarchies for accurate object detection and semant

ABC-CNN: Attention Based CNN for Visual Question Answering

Generative Adversarial Text to Image Synthesis

Real-Time Grasp Detection Using Convolutional Neural Networks

Content-Based Image Recovery

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Transcription:

(Deep) Learning for Robot Perception and Navigation Wolfram Burgard

Deep Learning for Robot Perception (and Navigation) Lifeng Bo, Claas Bollen, Thomas Brox, Andreas Eitel, Dieter Fox, Gabriel L. Oliveira, Luciano Spinello, Jost Tobias Springenberg, Martin Riedmiller, Michael Ruhnke, Abhinav Valada

Perception in Robotics Robot perception is a challenging problem and involves many different aspects such as Scene understanding Object detection Detection of humans Goal: improve perception in robotics scenarios using state-of-the-art deep learning methods

Why Deep Learning? Multiple layers of abstraction provide an advantage for solving complex pattern recognition problems Successful in computer vision for detection, recognition, and segmentation problems One set of techniques can serve different fields and be applied to solve a wide range of problems

What Our Robots Should Do RGB-D object recognition Images human part segmentation Sound terrain classification Asphalt Mowed Grass Grass

Multimodal Deep Learning for Robust RGB-D Object Recognition Andreas Eitel, Jost Tobias Springenberg, Martin Riedmiller, Wolfram Burgard [IROS 2015]

RGB-D Object Recognition

RGB-Depth Object Recognition Learned features + classifier Learned features Learning algorithm Sparse coding networks [Bo et. al 2012] Deep CNN features [Schwarz et. al 2015] End-to-end learning / Deep learning RGB-D CNN Learning algorithm Convolutional recurrent neural networks [Socher et. al 2012]

Often too little Data for Deep Learning Solutions Deep networks are hard to train and require large amounts of data Lack of large amount of labeled training data for RGB-D domain How to deal with limited sizes of available datasets?

Data often too Clean for Deep Learning Solutions Large portion of RGB-D data is recorded under controlled settings How to improve recognition in realworld scenes when the training data is clean? How to deal with sensor noise from RGB-D sensors?

Solution: Transfer Deep RGB Features to Depth Domain Both domains share similar features such as edges, corners, curves,

Solution: Transfer Deep RGB Features to Depth Domain Depth domain Pre-trained RGB CNN RGB domain Transfer* Depth encoding Fine-tune Re-train network features for depth * Similar to [Schwarz et. al 2015, Gupta et. al 2014]

Solution: Transfer Deep RGB Features to Depth Domain Depth domain Pre-trained RGB CNN RGB domain Transfer* Depth encoding Fine-tune Re-train network features for depth * Similar to [Schwarz et. al 2015, Gupta et. al 2014]

Multimodal Deep Convolutional Neural Network Two input modalities Late fusion network 10 convolutional layers Max pooling layers 4 fully connected layers Softmax classifier 2xAlexNet + fusion net

How to Encode Depth Images? Distribute depth over color channels Compute min and max value of depth map Shift depth map to min/max range Normalize depth values to lie between 0 and 255 Colorize image using jet colormap (red = near, blue = far) Depth encoding improves recognition accuracy by 1.8 percentage points RGB Raw depth Colorized depth

Solution: Noise-aware Depth Feature Learning Clean training data Noise samples Noise adaptation Classify

Training with Noise Samples Noise samples: 50,000 Randomly sample noise for each training batch Shuffle noise samples Training batch Input image

RGB Network Training Maximum likelihood learning Fine-tune from pre-trained AlexNet weights

Depth Network Training Maximum likelihood learning Fine-tune from pre-trained AlexNet weights

Fusion Network Training Fusion layers automatically learn to combine feature responses of the two network streams During training, weights in first layers stay fixed

UW RGB-D Object Dataset [Lai et. al, 2011] Category-Level Recognition [%] (51 categories) Method RGB Depth RGB-D CNN-RNN 80.8 78.9 86.8 HMP 82.4 81.2 87.5 CaRFs N/A N/A 88.1 CNN Features 83.1 N/A 89.4

UW RGB-D Object Dataset [Lai et. al, 2011] Category-Level Recognition [%] (51 categories) Method RGB Depth RGB-D CNN-RNN 80.8 78.9 86.8 HMP 82.4 81.2 87.5 CaRFs N/A N/A 88.1 CNN Features 83.1 N/A 89.4 This work, Fus-CNN 84.1 83.8 91.3

Confusion Matrix Label Prediction mushroom garlic pitcher coffee mug peach garlic

Recognition in Noisy RGB-D Scenes bowl cap soda can coffee mug Recognition using Noise adapt. flashlight annotated bounding boxes cap bowl soda can Noise adapt. = correct prediction No adapt. = false prediction Category-Level Recognition [%] depth modality (6 categories) cereal box coffee mug class avg. - 97.5 68.5 66.5 66.6 96.2 79.1 79.1 96.4 77.5 69.8 71.8 97.6 79.8 82.1

Deep Learning for RGB-D Object Recognition Novel RGB-D object recognition for robotics Two-stream CNN with late fusion architecture Depth image transfer and noise augmentation training strategy State of the art on UW RGB-D Object dataset for category recognition: 91.3% Recognition accuracy of 82.1% on the RGB-D Scenes dataset

Deep Learning for Human Part Discovery in Images Gabriel L. Oliveira, Abhinav Valada, Claas Bollen, Wolfram Burgard, Thomas Brox [submitted to ICRA 2016]

Deep Learning for Human Part Discovery in Images Human-robot interaction Robot rescue

Deep Learning for Human Part Discovery in Images Dense prediction can provide pixel classification of the image Human part segmentation is naturally challenging due to Non-rigid aspect of body Occlusions PASCAL Parts MS COCO Freiburg Sitting

Network Architecture Fully convolutional network Contraction and expansion of network input Up-convolution operation for expansion Pixel input, pixel output

Experiments Evaluation of approach on Publicly available computer vision datasets Real-world datasets with ground and aerial robots Comparison against state-of-the-art semantic segmentation approach: FCN proposed by Long et al. [1] [1] John Long, Evan Shelhamer, Trevor Darrell, CVPR 2015

Data Augmentation Due to the low number of images in the available datasets, augmentation is crucial Spatial augmentation (rotation + scaling) Color augmentation

PASCAL Parts Dataset PASCAL Parts, 4 classes, IOU PASCAL Parts, 14 classes, IOU

Freiburg Sitting People Part Segmentation Dataset We present a novel dataset for human part segmentation in wheelchairs Input Image Ground Truth Segmentation mask

Robot Experiments Range experiments with ground robot Aerial platform for disaster scenario Segmentation under severe body occlusions

Range Experiments Recorded using Bumblebee camera Robust to radial distortion Robust to scale (a) 1.0 meter (b) 2.0 meters (c) 3.0 meters (d) 4.0 meters (e) 5.0 meters (f) 6.0 meters

Freiburg People in Disaster Dataset designed to test severe occlusions Input Image Ground Truth Segmentation mask

Future Work Investigate the potential for human keypoint annotation Real-time part segmentation for small hardware Human part segmentation in videos

Deep Feature Learning for Acoustics-based Terrain Classification Abhinav Valada, Luciano Spinello, Wolfram Bugard [ISRR 2015]

Motivation Robots are increasingly being used in unstructured real-world environments

Motivation Lighting Variations Shadows Dirt on Lens Optical sensors are highly sensitive to visual changes

Motivation Use sound from vehicle-terrain interactions to classify terrain

Network Architecture Novel architecture designed for unstructured sound data Global pooling gathers statistics of learned features across time

Data Collection Wood Linoleum Carpet P3-DX Asphalt Mowed Grass Grass Paving Cobble Stone Offroad

Results - Baseline Comparison (300ms window) [1] [2] [5] [6] [3] [4] 16.9% 99.41% improvement using a 500ms over window the previous state of the art [1] T. Giannakopoulos, K. Dimitrios, A. Andreas, and T. Sergios, SETN 2006 [2] M. C. Wellman, N. Srour, and D. B. Hillis, SPIE 1997. [3] J. Libby and A. Stentz, ICRA 2012 [4] D. Ellis, ISMIR 2007 [5] G. Tzanetakis and P. Cook, IEEE TASLP 2002 [6] V. Brijesh, and M. Blumenstein, Pattern Recognition Technologies and Applications 2008

Robustness to Noise Per-class Precision

Noise Adaptive Fine-Tuning Avg. accuracy of 99.57% on the base model

Real-World Stress Testing - True Positives - False Positives Avg. accuracy of 98.54%

Can you guess the terrain? Social Experiment Avg. human performance = 24.66% Avg. network performance = 99.5% Go to deepterrain.cs.unifreiburg.de Listen to five sound clips of a robot traversing on different terrains Guess what terrain they are

Conclusions Classifies terrain using only sound State-of-the art performance in proprioceptive terrain classification New DCNN architecture outperforms traditional approaches Noise adaptation boosts performance Experiments with a low-quality microphone demonstrates robustness

Overall Conclusions Deep networks are a promising approach to solve complex perception problems in robotics The key challenges are finding the proper architecture using proper data augmentation strategies