G-CNN: an Iterative Grid Based Object Detector

Similar documents
R-FCN: Object Detection with Really - Friggin Convolutional Networks

Yiqi Yan. May 10, 2017

Lecture 5: Object Detection

G-CNN: an Iterative Grid Based Object Detector

Object detection with CNNs

Object Detection Based on Deep Learning

Rich feature hierarchies for accurate object detection and semant

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Spatial Localization and Detection. Lecture 8-1

YOLO: You Only Look Once Unified Real-Time Object Detection. Presenter: Liyang Zhong Quan Zou

Object Detection on Self-Driving Cars in China. Lingyun Li

YOLO 9000 TAEWAN KIM

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

YOLO9000: Better, Faster, Stronger

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Project 3 Q&A. Jonathan Krause

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Modern Convolutional Object Detectors

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

arxiv: v1 [cs.cv] 4 Jun 2015

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Future directions in computer vision. Larry Davis Computer Vision Laboratory University of Maryland College Park MD USA

Rich feature hierarchies for accurate object detection and semantic segmentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Multi-View 3D Object Detection Network for Autonomous Driving

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

OBJECT DETECTION HYUNG IL KOO

Deep Learning for Object detection & localization

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Semantic Segmentation

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

arxiv: v1 [cs.cv] 26 Jun 2017

Fully Convolutional Networks for Semantic Segmentation

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Cascade Region Regression for Robust Object Detection

Flow-Based Video Recognition

Computer Vision Lecture 16

Learning to Detect Activity in Untrimmed Video. Prof. Bernard Ghanem

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

CAD: Scale Invariant Framework for Real-Time Object Detection

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

Lecture 7: Semantic Segmentation

Final Report: Smart Trash Net: Waste Localization and Classification

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

From 3D descriptors to monocular 6D pose: what have we learned?

Context Refinement for Object Detection

Rich feature hierarchies for accurate object detection and semantic segmentation

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

Object Detection with YOLO on Artwork Dataset

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

arxiv: v1 [cs.cv] 2 Sep 2018

Learning to Segment Object Candidates

Computer Vision Lecture 16

Computer Vision Lecture 16

Regionlet Object Detector with Hand-crafted and CNN Feature

Pseudo Mask Augmented Object Detection

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Unified, real-time object detection

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

arxiv: v1 [cs.cv] 9 Dec 2018

Pseudo Mask Augmented Object Detection

Multiple Instance Detection Network with Online Instance Classifier Refinement

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Photo OCR ( )

Self Driving. DNN * * Reinforcement * Unsupervised *

Struck: Structured Output Tracking with Kernels. Presented by Mike Liu, Yuhang Ming, and Jing Wang May 24, 2017

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

arxiv: v1 [cs.cv] 15 Oct 2018

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

arxiv: v1 [cs.cv] 14 Jun 2016

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Optimizing Object Detection:

Adaptive Object Detection Using Adjacency and Zoom Prediction

A Lightweight YOLOv2:

Xiaowei Hu* Lei Zhu* Chi-Wing Fu Jing Qin Pheng-Ann Heng

Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks

arxiv: v1 [cs.cv] 31 Mar 2016

Visual features detection based on deep neural network in autonomous driving tasks

segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

arxiv: v1 [cs.cv] 14 Dec 2015

arxiv: v2 [cs.cv] 8 Apr 2018

SSD: Single Shot MultiBox Detector

RON: Reverse Connection with Objectness Prior Networks for Object Detection

Martian lava field, NASA, Wikipedia

Human Pose Estimation with Deep Learning. Wei Yang

Learning Deep Structured Models for Semantic Segmentation. Guosheng Lin

Why Is Recognition Hard? Object Recognizer

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Transcription:

G-CNN: an Iterative Grid Based Object Detector Magyar Najibi Univ. Maryland Mohammad Rastegari Univ. Maryland Larry S. Davis Univ. Maryland CVPR, 2016 VGG Reading Group - Sam Albanie

Motivation: Proposals are Expensive Example: find the father of the internet Selective Search 2.24 seconds EdgeBoxes 0.38 seconds

Motivation: Proposals are Expensive Example: find the father of the internet Cheaper Alternative: grids

Motivation: Keep accuracy with iterations! Downside of using grids: loss of accuracy In G-CNN, high accuracy is achieved with grid proposals by using an iterative bounding box regression scheme Inspired by IEF - move the work into the regression space!

Some members of the postdeepluvian* object detection family tree ARXIV Apr, 2015 FAST R-CNN ARXIV Nov, 2015 ProNet CVPR 2016 ARXIV Nov, 2015 ARXIV Nov, 2013 ARXIV June, 2014 ICCV 2015 ARXIV June, 2015 LocNet R-CNN SPP-Net Faster RCNN CVPR 2016 ARXIV June, 2015 CVPR 2014 ECCV 2014 NIPS 2015 YOLO ARXIV June, 2015 CVPR 2016 *Serge Bolongieism R-CNN minus R BMVC 2015 Fully connected bidirectional inspiration layer ARXIV Dec, 2015 G-CNN CVPR 2016

Potentially interesting R-CNN authors investigated iterative procedure: At test time, we score each proposal and predict its new detection window only once. In principle, we could iterate this procedure (i.e. re-score the newly predicted bounding box and then predict a new bounding box from it, and so on). However, we found that iterating does not improve results APPENDIX C - Rich Feature Hierarchies for accurate object detection and semantic segmentation

Discussion

How does it work?

Bounding Box Regression In Object Detection: Recap Key idea: snakes are not the same shape as donkeys. (i.e. once you have predicted the object category, you should be able to improve your bounding box) Introduced in the DPM paper (geometric features) Revisited in the R-CNN paper (CNN features)

Bounding Box Regression In Object Detection: R-CNN style Training: The goal is to learn a mapping per category from a proposed box P to a ground truth box G. Inputs are N training pairs {(P i,g i )} i=1,...,n, where P i =(P i x,p i y,p i w,p i h) Parameterise mapping with linear functions such that: d x (P ),d y (P ),d w (P ),d h (P ) ˆ G x = P w d x (P )+P x Ĝ y = P h d y (P )+P y (scale invariant) Gw ˆ = P w exp(d w (P ))) Gˆ h = P h exp(d h (P ))) (log space) The functions are learned with ridge regression.

Training Architecture Notes: bounding box colours

Bounding Box Regression: The Nitty Gritty for G-CNN Training - each bounding box with IoU > 0.2 assigned to one of ground truth boxes in the same image, based on its initial grid position. The function is learned with piece-wise regression, using target boxes at step 1 < s < S_train

Bounding Box Regression: The Nitty Gritty for G-CNN Loss function: L({B i })= SX train s=1 NX [I(B 1 i /2 B BG ) L reg ( i,l s i (B s i, (B s i, A(B s i ),s)))] i=1 L_reg is the smooth L1 loss from Fast R-CNN:

Bounding Box Regression: The Nitty Gritty for G-CNN For efficiency during training, approximate predicted update with the perfect update

Optimisation SGD ftw. Note: sampling biases early iteration steps

Test-time architecture Comparison to R-CNN: N_proposal vs (S_test x N_grid)

Demo

Experiments

Config 2x2 5x5 10x10 Training overlaps: [0.9, 0.8, 0.7] Test overlaps: [0.7, 0.5, 0] Regression network is trained for S = 3 steps

Experiment 1:VOC 2007 Each network was based on Alexnet, trained on VOC 2007 trainval set and evaluated on the test set. FR-CNN := One step at test time + approx 2000 initial boxes (SS) G-CNN(3) := Three steps at test time + approx 1500 initial boxes G-CNN(5) := Five steps at test time + approx 180 initial boxes

Experiment 2:VOC 2007 Each network was based on VGG-16, trained on VOC 2007 trainval set and evaluated on the test set. Claim: G-CNN effectively moves small # of boxes to targets

Experiment 3:VOC 2012 Each network was based on VGG-16 with the following training 12 := VOC2012 trainval, 07+12 := VOC2007 trainval + VOC2012 trainval 07++12 := VOC2007 trainval/test + VOC2012 trainval Claim: G-CNN provides best map without a proposal stage

Experiment 4:VOC 2007 Each network was based on Alexnet, trained on VOC 2007 trainval set and evaluated on the test set, with five steps at test time IF-FRCNN := Apply FR-CNN iteratively 1Step-Grid := Train G-CNN with all tuples in one step Claim: Stepwise training matters

Analysis of Detection Results Claim: Removing proposal stage did not hurt localisation

Detection Run Time Benchmarks with two K40 GPUs with VGG16 Net Fast R-CNN: 0.5 fps G-CNN: 3 fps

Rough comparison with current state of the art (VOC 2007 test set) Different training sets give an idea of how well the model scales with additional data. Table compiled July 2016 Model Training Speed (juice) map G-CNN 07 3fps (2xK40) 66.8 Faster R-CNN 07+12 5fps (K40) 73.2 SSD-300 07+12 58fps (TITAN X) 72.1 SSD-500 07+12 23fps (TITAN X) 75.1 R-FCN 07+12 6fps (K40) 80.5 Faster R-CNN 07+12+CO 5fps (K40) 85.6 R-FCN 07+12+CO 6fps (K40) 83.6 NOTE: By the time you are reading this, it is probably out of date