R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Similar documents
Modern Convolutional Object Detectors

Team G-RMI: Google Research & Machine Intelligence

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

R-FCN: Object Detection with Really - Friggin Convolutional Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Object Detection on Self-Driving Cars in China. Lingyun Li

Object detection with CNNs

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Spatial Localization and Detection. Lecture 8-1

Lecture 5: Object Detection

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Multi-View 3D Object Detection Network for Autonomous Driving

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

YOLO9000: Better, Faster, Stronger

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Automatic detection of books based on Faster R-CNN

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Yiqi Yan. May 10, 2017

Fusion of Camera and Lidar Data for Object Detection using Neural Networks

Inception Network Overview. David White CS793

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Object Detection Based on Deep Learning

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Deep Learning for Object detection & localization

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Flow-Based Video Recognition

arxiv: v1 [cs.cv] 9 Dec 2018

Deep Residual Learning

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

YOLO 9000 TAEWAN KIM

OBJECT DETECTION HYUNG IL KOO

Detection and Segmentation of Manufacturing Defects with Convolutional Neural Networks and Transfer Learning

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Optimizing Object Detection:

RON: Reverse Connection with Objectness Prior Networks for Object Detection

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

arxiv: v2 [cs.cv] 3 Sep 2018

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

TRAFFIC ANALYSIS USING VISUAL OBJECT DETECTION AND TRACKING

Pelee: A Real-Time Object Detection System on Mobile Devices

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

Regionlet Object Detector with Hand-crafted and CNN Feature

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Rich feature hierarchies for accurate object detection and semantic segmentation

Final Report: Smart Trash Net: Waste Localization and Classification

arxiv: v1 [cs.cv] 15 Oct 2018

Joint Object Detection and Viewpoint Estimation using CNN features

Rich feature hierarchies for accurate object detection and semantic segmentation

Unified, real-time object detection

Toward Scale-Invariance and Position-Sensitive Region Proposal Networks

Yield Estimation using faster R-CNN

EFFECTIVE OBJECT DETECTION FROM TRAFFIC CAMERA VIDEOS. Honghui Shi, Zhichao Liu*, Yuchen Fan, Xinchao Wang, Thomas Huang

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

Fuzzy Set Theory in Computer Vision: Example 3

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Deep Face Recognition. Nathan Sun

Deep Learning with Tensorflow AlexNet

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

STREET OBJECT DETECTION / TRACKING FOR AI CITY TRAFFIC ANALYSIS

Rich feature hierarchies for accurate object detection and semant

Semantic Segmentation

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

G-CNN: an Iterative Grid Based Object Detector

Lecture 7: Semantic Segmentation

Finding Tiny Faces Supplementary Materials

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. Presented by: Karen Lucknavalai and Alexandr Kuznetsov

Advanced Video Analysis & Imaging

MoonRiver: Deep Neural Network in C++

An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches

arxiv: v1 [cs.cv] 14 Dec 2015

A Deep Learning Approach to Vehicle Speed Estimation

POINT CLOUD DEEP LEARNING

An Analysis of Scale Invariance in Object Detection SNIP

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

arxiv: v2 [cs.cv] 19 Apr 2018

Real-time Object Detection CS 229 Course Project

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Optimizing Object Detection:

Face Recognition A Deep Learning Approach

Real-time convolutional networks for sonar image classification in low-power embedded systems

CAP 6412 Advanced Computer Vision

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

A Comparison of CNN-based Face and Head Detectors for Real-Time Video Surveillance Applications

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

arxiv: v1 [cs.cv] 22 Mar 2018

arxiv: v4 [cs.cv] 12 Jul 2018

arxiv: v5 [cs.cv] 11 Dec 2018

An Analysis of Scale Invariance in Object Detection SNIP

Transcription:

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS JIFENG DAI YI LI KAIMING HE JIAN SUN MICROSOFT RESEARCH TSINGHUA UNIVERSITY MICROSOFT RESEARCH MICROSOFT RESEARCH SPEED/ACCURACY TRADE-OFFS FOR MODERN CONVOLUTIONAL OBJECT DETECTORS JONATHAN HUANG VIVEK RATHOD CHEN SUN MENGLONG ZHU ANOOP KORATTIKARA ALIREZA FATHI IAN FISCHER ZBIGNIEW WOJNA YANG SONG SERGIO GUADARRAMA KEVIN MURPHY Deep Learning Seminar Tel-Aviv university Instructor: Dr. Raja Giryes Gilad Uziel Netzah Calamaro

Introduction There are two family methods for object detection region - based - (two stages) single - shot (one stage) R-FCN is hybrid of both Use Region Proposal Network (RPN) Work on entire image simultaneously

ROI ROI The Main Idea k ROI k k k k 0.1 2-1.2-3 1. 7 4 0.125 2.8 1.35 2.3 5 0.2 k 1.1 2.5 2.4 0.7 6.1 2.2 0.3 2.4 1.9 0.3 4.875 2.875-1 4 3.2 1.2 7 4.2 Feature maps k k 2.- 1.2 2.1 4.4 1.1. 0.8 3.2 0.2 1.3 3.65 0.225 1.2 4.8 0.8 0.6 ROI k k k k 1.2 0.3 3.2 1.6-1 3.8 1.65 0.725 4.25 1.6 2.4 5.2 2.2 1.2 4.8

R-FCN Architecture

R-FCN Architecture

R-FCN Architecture

R-FCN Architecture

R-FCN Architecture

R-FCN Architecture

Bounding Box Aside from the k 2 c + 1 -d convolutional layer, we append a sibling 4k 2 -d convolutional layer for bounding box regression. The position-sensitive RoI pooling is performed on this bank of 4k 2 maps. producing a 4k 2 -d vector for each RoI. Then it is aggregated into a 4-d vector by average voting. This 4-d vector parameterizes a bounding box as t = (t x, t y, t w, t h ).

Visualization Visualization of R-FCN for the person category when an RoI does not correctly overlap the object (k k = 3 3).

Visualization Visualization of R-FCN for the person category when an RoI does correctly overlap the object (k k = 3 3).

Loss Function L(s, t x,y,w,h ) = L cls s c + λ c > 0 L reg t, t. L cls s c computed by Softmax function. L reg t, t computed by smooth L1 function. c > 0 - indicator which equals to 1 if the argument is true and 0 otherwise. We set the balance weight λ = 1. c - RoI s ground-truth label (c = 0 means background). t - RoI s ground-truth box.

Backpropagation For the RPN we define positive examples as the RoIs that have intersection-over-union (IoU) overlap with a ground-truth box of at least 0.5, and negative otherwise. Backpropagation is performed based on B = 128 RoIs that have the highest loss (positive and negative) the selected examples.

Backbone Architecture The incarnation of R-FCN based on ResNet-101. ResNet-101 has 100 convolutional layers followed by global average pooling and a 1000-class fc layer. We remove the average pooling layer and the fc layer and only use the convolutional layers to compute feature maps. The last convolutional block in ResNet-101 is 2048-d, and we attach a randomly initialized 1024-d 1 1 convolutional layer for reducing dimension. Then we apply the k 2 (c + 1) - channel convolutional layer to generate score maps.

Results No. of proposals - 300 K X K = 7 X 7 83. 6% map - on the PASCAL VOC 2007 170 ms - test time, per image

Speed/accuracy trade-offs for modern convolutional object detectors Comparative study of R-FCN, SSD and Faster R-CNN

motivation of 2 nd paper Most works discuss only accuracy: This work focus also on memory/speed and on accuracy/speed/memory trade-off Selection of correct algorithm for a specific purpose, and optimization of parameters within that algorithm: 1. Mobile devices (cellular) require low memory footprint 2. Autonomous cars require real-time performance = speed and accuracy 3. Server-side applications such as google/ facebook require accuracy still throughput bottleneck 4. Contests require accuracy Compare apples vs apples require an objective, comprehensive test bench that can show the differences need to develop a test bench SSD+ YOLO SSD Summit Bhala Faster R-CNN performance (kaimin He) R-FCN performance (Vatsal Srivastava)

Location of videos: Yolo vs SSD: https://www.youtube.com/watch?v=8ql69caj2ku R-FCN: https://www.youtube.com/watch?v=jljhxuzoeaq Faster R-CNNN: https://www.youtube.com/watch?v=wzmsmkk9vua

motivation of 2 nd paper There are sweet points on the trade-off graph, where investment of a lot of GPU time yield small accuracy gain. This may be looked reversibly: one may invest much less GPU time with little accuracy loss

Comparative architecture reminder Results SSD Faster R-CNN R-FCN

EXPERIMENTAL PLATFORM Use 6 feature extractors at all detectors VGG-16 ResNet101 Inspection V2 (semantic segmentation) Inspection V3 Inspection ResNet MobileNet Which platforms used

EXPERIMENTAL PLATFORM ADDITIONAL DETAILS

Loss function configuration What is the loss function Matching anchors to ground-truth instances Argmax vs. Bipartite matching Input size configuration Computation platform: Intel Xeon E5-1650 (6 cores) Nvidia GTX titan GPU 4 times more computation than home gamer card W X h size rectangle L1 norm location loss function תחום רחב Variable resolution input size assure

Training and hyper-parameter training Asynchronous SGD when some mini-batch compute it s gradient it is added to the total gradient without waiting for the others and continue it s training. Might cause delayed SGD but is faster. Results:

Mean Average Precision (map) (mili sec)

Mean Average Precision (map) conclusions R-FCN, SSD are faster than R-CNN on average Faster R-CNN is more accurate Sweet spot: a point where in order to obtain little more accuracy much speed must be sacrificed. Another way to view Sweet spot : little GPU is invested without sacrifice too much accuracy Larger feature extractors are slower What is inception?

Mean Average Precision (map) Colored by feature extractor (mili sec)

Mean Average Precision (map) Colored by feature extractor - conclusions Larger feature extractors are slower The colored cluster show relation to feature extractor. Architectures (R- FCN, R-CNN,SSD) were implemented using various feature extractors That makes the test bench variable MobileNet, Inception V2 are faster on average than inception Resnet V2 due to being smaller feature extractors Sweet spot: a point where in order to obtain little more accuracy - much speed must be sacrificed

Memory vs. GPU time for different feature extractors (mili sec)

Memory vs. GPU for different feature extractors - conclusions Larger feature extractors are both slower and demand more memory. It comes together: larger means more memory and occasionally more GPU time Inception ResNet V2 is more memory and demand consuming MobileNet with SSD is fastest and minimal GPU/memory consuming Sweet spot: R-FCN w/resnet 101, and Faster R-CNN w/resnet 101 with only 50 proposals R-FCN w/ Resnet 101 at 100ms GPU with high accuracy and not too high memory consumption

map for each object size by meta-architecture and feature extractor accuracy

map for each object size by meta-architecture and feature extractor How to read Bar Graph: partitions each feature extractor model by object size (small, medium, large). 3 architectures are drawn per each feature extractor SSD has (very) poor performance on small objects and competitive with Faster R-CNN, R-FCN on larger objects outperforming them when they are with lightweight feature extractors Small object improved resolution may compensate for its size, in accuracy

map on small objects vs map on large objects colored by input resolution SSD

map on small objects vs map on large objects colored by input resolution High resolution models lead to significantly better map results(*) on small objects (~*2) and somewhat better results on large objects In SSD higher resolution improve large objects accuracy but is less successful at small objects accuracy improvement R-FCN, Faster R-CNN, SSD: Strong performance on small objects implies strong performance on large objects. Opposite is not correct: SSD perform well on large objects but poor on small objects

map vs Top-1 accuracy of the feature extractor on imagenet

map vs Top-1 accuracy of the feature extractor on imagenet There is an overall correlation between classification (=feature extraction) accuracy and detection (=overall) accuracy This correlation appears to only be more significant for Faster R- CNN and R-FCN The performance of SSD appears to be less reliant on its feature extractor s classification accuracy SSD is unable to fully leverage the power of the ResNet and Inception ResNet feature extractors Using cheaper feature extractors does not hurt SSD too much. With large objects it is competitive with Faster R-CNN and F-RCN

Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN on map (solid line) and GPU time (dash line) The effect of adjusting number of proposals on performance fixed Faster R-CNN R-FCN

Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN on map (solid line) and GPU time (dash line) Figure (a): For Faster R-CNN with Inception Resnet feature extractor with 50 proposals, 96% of the 300 proposals accuracy is obtained, reducing GPU runtime by factor 3 Figure (a): Using Inception Resnet, which has 35.4% map with 300 proposals accuracy is maintained similar (29% map) with only 10 proposals. Sweet spot is around 50 proposals Figure (a): similar tradeoffs hold for other feature extractors although less intense

Effect of proposing fewer regions in (a) Faster R-CNN and )b) R-FCN on map (solid line) and GPU time (dash line) Figure (b) savings from using fewer proposals in the R-FCN setting are minimal, since box classifier (the expensive part) is only run once per image. Figure (b) at 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet, becomes comparable to that of equivalent R-FCN models which use 300 proposals in both map and GPU speed. Faster R-CNN dramatic proposals-to-gpu effect, less significant proposals-to-accuracy effect R-FCN mild effect of proposals over GPU, accuracy

State-of-the-art detection with MS COCO dataset What is multi cropping inference? What is map, AR?

Facts: Run on COCO dataset. Average accuracy is taken at thresholds 50%, 95% Table 3: Test is ensemble of 5 best performance, fast R-CNN RestNet feature extractors table 2 results: Interpretation of tables 2,3,4 The model average accuracy is 41.3%, better than previous results 37.1% Improvement of ~60% accuracy for small objects over previous result

Thank you Questions?

Modifies the proposal generator to directly output class probability (instead of objectiveness). 1) No separate proposal generator such as R-CNN. 2) Direct link from Feature extractor to detection generator Pros: Very fast (suitable for mobile applications, autonomous vehicle Cons: Not good at detecting smaller object (YOLO) but using feature maps from different layers can help a lot (SSD) back

Reminder: Start with Feature Extractor continue with Proposal Generator, then Box Classifier Feature Extractor: 5 convolution layers Proposal generator: insert after conv5 of the feature extractor output = bounding boxes and objectiveness Box classifier: input = crop of conv5 from the bounding boxes with ROI pooling to get feature maps of fixed size; pass through = fc* ; output = class probability Pro: best performing accuracy Con: GPU runtime depends on the number of proposal back

Translation-variance in detection. The classification network to output the same thing if the cat moves from the top left to bottom right (object detection), but the Region- Proposal-Network (object location) to output differently) Box classifier is given the crop of fc6 instead of conv5. Computation for each proposal is reduced New position sensitive score maps: shape = k*k * (C+1), h, w. So this encodes the position into the channel dimension New position-sensitive ROI pooling: input = k * k * (c + 1), roi_h, roi_w ; pool = c + 1, k, k ; output = c+1. In the other words, top-left bin will only pool from some filters. Classifier: input = feature maps Pro: a variation of R-FCN (TA-FCN) is best instance segmentation architecture Pro: fast and pretty accurate Cons: less accurate than Faster R-CNN back

What is map and AR back

multi- cropping inference A novel pooling strategy that crops different regions from convolutional feature maps and applies max-pooling at varying times back

Loss function:, weight balancing localization and classification losses predicted box encoding - location loss f ( ;, ) loc a location class - classification loss - box encoding of box a with respect to anchor b class - cls - class label l ( x ) l ( x) loc ( b, a) a f ( ; a, ) image parameters, - model parameters, - negative anchor cl y a a back

What is: Region-of-Interest pooling For example, to detect multiple cars and pedestrians in a single image. Its purpose is to perform max pooling on inputs of nonuniform sizes to obtain fixed-size feature maps (e.g. 7 7). back

Inception pooling module module By parallelizing layers and combining them back less computation invested equivalent to using additional depth layers back